best practice on too many files vs IO overhead

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

best practice on too many files vs IO overhead

Istvan Soos
Hi,

I've a requirement that involves frequent, batched update of my Lucene
index. This is done by a memory queue and process that periodically
wakes and process that queue into the Lucene index.

If I do not optimize my index, I'll receive "too many open files"
exception (yeah, right, I can get the OS's limit up a bit, but that
just prolongs the exception).
If I do optimize my index, I'll receive a very large IO overhead (as
it reads again and writes the whole index).

Right now I'm optimizing the index on each batch cycle, but as my
index size quickly goes to around 1GB, I experience great overhead in
the IO operations. The update shall happen frequently (1-10 times per
minute), so I'm looking for advices how to solve this issue. I might
split the index, but that way I'll receive the "too many open files"
sooner, and in the end the IO overhead remains...

Any suggestions?
Thanks,
   Istvan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best practice on too many files vs IO overhead

Michael McCandless-2
Are you sure you're closing all readers that you're opening?

It's surprising with normal usage of Lucene that you'd run out of
descriptors, with its default mergeFactor (have you increased the
mergeFactor)?

You can also enable compound file, which uses far fewer file
descriptors, at some cost to indexing performance.

Also, a partial optimize (ie optimize(N)) does less IO but still
substantially reduces segment count of the index.

Mike

On Fri, Nov 27, 2009 at 4:23 AM, Istvan Soos <[hidden email]> wrote:

> Hi,
>
> I've a requirement that involves frequent, batched update of my Lucene
> index. This is done by a memory queue and process that periodically
> wakes and process that queue into the Lucene index.
>
> If I do not optimize my index, I'll receive "too many open files"
> exception (yeah, right, I can get the OS's limit up a bit, but that
> just prolongs the exception).
> If I do optimize my index, I'll receive a very large IO overhead (as
> it reads again and writes the whole index).
>
> Right now I'm optimizing the index on each batch cycle, but as my
> index size quickly goes to around 1GB, I experience great overhead in
> the IO operations. The update shall happen frequently (1-10 times per
> minute), so I'm looking for advices how to solve this issue. I might
> split the index, but that way I'll receive the "too many open files"
> sooner, and in the end the IO overhead remains...
>
> Any suggestions?
> Thanks,
>   Istvan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best practice on too many files vs IO overhead

Istvan Soos
On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
<[hidden email]> wrote:
> Are you sure you're closing all readers that you're opening?

Absolutely. :) (okay, never say this, but I had bugz because of this
previously so I'm pretty sure that one is ok).

> It's surprising with normal usage of Lucene that you'd run out of
> descriptors, with its default mergeFactor (have you increased the
> mergeFactor)?

Default merge factor. (on Mac, the default maxfiles is 256, however
I've run out of descriptors event at 10240, if I hadn't called
optimize).

> You can also enable compound file, which uses far fewer file
> descriptors, at some cost to indexing performance.

I thought this is the default but I'll check...

> Also, a partial optimize (ie optimize(N)) does less IO but still
> substantially reduces segment count of the index.

I wasn't aware of this, thanks, I'll try it!

Regards,
   Istvan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best practice on too many files vs IO overhead

Michael McCandless-2
If in fact you are using CFS (it is the default), and your OS is
letting you use 10240 descriptors, and you haven't changed the
mergeFactor, then something is seriously wrong.  I would triple check
that all readers are being closed.

Or... if you list the index directory, how many files do you see?

Mike

On Fri, Nov 27, 2009 at 5:48 AM, Istvan Soos <[hidden email]> wrote:

> On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
> <[hidden email]> wrote:
>> Are you sure you're closing all readers that you're opening?
>
> Absolutely. :) (okay, never say this, but I had bugz because of this
> previously so I'm pretty sure that one is ok).
>
>> It's surprising with normal usage of Lucene that you'd run out of
>> descriptors, with its default mergeFactor (have you increased the
>> mergeFactor)?
>
> Default merge factor. (on Mac, the default maxfiles is 256, however
> I've run out of descriptors event at 10240, if I hadn't called
> optimize).
>
>> You can also enable compound file, which uses far fewer file
>> descriptors, at some cost to indexing performance.
>
> I thought this is the default but I'll check...
>
>> Also, a partial optimize (ie optimize(N)) does less IO but still
>> substantially reduces segment count of the index.
>
> I wasn't aware of this, thanks, I'll try it!
>
> Regards,
>   Istvan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best practice on too many files vs IO overhead

Istvan Soos
You were right, my bad...

I have an async reader closing on a scheduled basis (after the writer
refreshes the index, to not interrupt the ongoing searches), but while
I've setup the scheduling for my first two index, I've forgotten it in
my third... oh dear...

Thanks anyway the info, it was useful indeed.
Regards,
   Istvan

On Fri, Nov 27, 2009 at 12:02 PM, Michael McCandless
<[hidden email]> wrote:

> If in fact you are using CFS (it is the default), and your OS is
> letting you use 10240 descriptors, and you haven't changed the
> mergeFactor, then something is seriously wrong.  I would triple check
> that all readers are being closed.
>
> Or... if you list the index directory, how many files do you see?
>
> Mike
>
> On Fri, Nov 27, 2009 at 5:48 AM, Istvan Soos <[hidden email]> wrote:
>> On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
>> <[hidden email]> wrote:
>>> Are you sure you're closing all readers that you're opening?
>>
>> Absolutely. :) (okay, never say this, but I had bugz because of this
>> previously so I'm pretty sure that one is ok).
>>
>>> It's surprising with normal usage of Lucene that you'd run out of
>>> descriptors, with its default mergeFactor (have you increased the
>>> mergeFactor)?
>>
>> Default merge factor. (on Mac, the default maxfiles is 256, however
>> I've run out of descriptors event at 10240, if I hadn't called
>> optimize).
>>
>>> You can also enable compound file, which uses far fewer file
>>> descriptors, at some cost to indexing performance.
>>
>> I thought this is the default but I'll check...
>>
>>> Also, a partial optimize (ie optimize(N)) does less IO but still
>>> substantially reduces segment count of the index.
>>
>> I wasn't aware of this, thanks, I'll try it!
>>
>> Regards,
>>   Istvan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: best practice on too many files vs IO overhead

Michael McCandless-2
In reply to this post by Michael McCandless-2
Phew :)  Thanks for bringing closure!

Mike

On Fri, Nov 27, 2009 at 6:02 AM, Michael McCandless
<[hidden email]> wrote:

> If in fact you are using CFS (it is the default), and your OS is
> letting you use 10240 descriptors, and you haven't changed the
> mergeFactor, then something is seriously wrong.  I would triple check
> that all readers are being closed.
>
> Or... if you list the index directory, how many files do you see?
>
> Mike
>
> On Fri, Nov 27, 2009 at 5:48 AM, Istvan Soos <[hidden email]> wrote:
>> On Fri, Nov 27, 2009 at 11:37 AM, Michael McCandless
>> <[hidden email]> wrote:
>>> Are you sure you're closing all readers that you're opening?
>>
>> Absolutely. :) (okay, never say this, but I had bugz because of this
>> previously so I'm pretty sure that one is ok).
>>
>>> It's surprising with normal usage of Lucene that you'd run out of
>>> descriptors, with its default mergeFactor (have you increased the
>>> mergeFactor)?
>>
>> Default merge factor. (on Mac, the default maxfiles is 256, however
>> I've run out of descriptors event at 10240, if I hadn't called
>> optimize).
>>
>>> You can also enable compound file, which uses far fewer file
>>> descriptors, at some cost to indexing performance.
>>
>> I thought this is the default but I'll check...
>>
>>> Also, a partial optimize (ie optimize(N)) does less IO but still
>>> substantially reduces segment count of the index.
>>
>> I wasn't aware of this, thanks, I'll try it!
>>
>> Regards,
>>   Istvan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]