Dynamically varying maxBufferedDocs

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Dynamically varying maxBufferedDocs

Chuck Williams-2
Hi All,

Does anybody have experience dynamically varying maxBufferedDocs?  In my
app, I can never truncate docs and so work with maxFieldLength set to
Integer.MAX_VALUE.  Some documents are large, over 100 MBytes.  Most
documents are tiny.  So a fixed value of maxBufferedDocs to avoid OOM's
is too small for good ongoing performance.

It appears to me that the merging code will work fine if the initial
segment sizes vary.  E.g., a simple solution is to make
IndexWriter.flushRamSegments() public and manage this externally (for
which I already have all the needed apparatus, including size
information, the necessary thread synchronization, etc.).

A better solution might be to build a size-management option into the
maxBufferedDocs mechanism in lucene, but at least for my purposes, that
doesn' t appear necessary as a first step.

My main concern is that the mergeFactor escalation merging logic will
somehow behave poorly in the presence of dynamically varying initial
segment sizes.

I'm going to try this now, but am wondering if anybody has tried things
along these lines and might offer useful suggestions or admonitions.

Thanks for any advice,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Yonik Seeley-2
On 11/9/06, Chuck Williams <[hidden email]> wrote:
> My main concern is that the mergeFactor escalation merging logic will
> somehow behave poorly in the presence of dynamically varying initial
> segment sizes.

Things will work as expected with varying segments sizes, but *not*
varying maxBufferedDocuments.  The "level" of a segment is defined by
maxBufferedDocuments.

If there were a solution to flush early w/o maxBufferedDocuments
changing, things would work fine.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Chuck Williams-2
Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
just am making flushRamSegments() public and calling it externally
(properly synchronized), earlier than it would otherwise be called from
ongoing addDocument-driven merging.

Sounds like this should work.

Chuck


Yonik Seeley wrote on 11/09/2006 08:37 AM:

> On 11/9/06, Chuck Williams <[hidden email]> wrote:
>> My main concern is that the mergeFactor escalation merging logic will
>> somehow behave poorly in the presence of dynamically varying initial
>> segment sizes.
>
> Things will work as expected with varying segments sizes, but *not*
> varying maxBufferedDocuments.  The "level" of a segment is defined by
> maxBufferedDocuments.
>
> If there were a solution to flush early w/o maxBufferedDocuments
> changing, things would work fine.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Yonik Seeley-2
On 11/9/06, Chuck Williams <[hidden email]> wrote:
> Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
> just am making flushRamSegments() public and calling it externally
> (properly synchronized), earlier than it would otherwise be called from
> ongoing addDocument-driven merging.
>
> Sounds like this should work.

Yep.
For best behavior, you probably want to be using the current
(svn-trunk) version of Lucene with the new merge policy.  It ensures
there are mergeFactor segments with size <= maxBufferedDocs before
triggering a merge.  This makes for faster indexing in the presence of
deleted docs or partially full segments.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Chuck Williams-2

Yonik Seeley wrote on 11/09/2006 08:50 AM:
> For best behavior, you probably want to be using the current
> (svn-trunk) version of Lucene with the new merge policy.  It ensures
> there are mergeFactor segments with size <= maxBufferedDocs before
> triggering a merge.  This makes for faster indexing in the presence of
> deleted docs or partially full segments.
>

I've got quite a few local patches unfortunately.  It will take a while
to sync up.  If I don't already have this new logic, can I pick it up by
just merging with the latest IndexWriter or are the changes more extensive?

Thanks again,

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Chuck Williams-2


Chuck Williams wrote on 11/09/2006 08:55 AM:

> Yonik Seeley wrote on 11/09/2006 08:50 AM:
>  
>> For best behavior, you probably want to be using the current
>> (svn-trunk) version of Lucene with the new merge policy.  It ensures
>> there are mergeFactor segments with size <= maxBufferedDocs before
>> triggering a merge.  This makes for faster indexing in the presence of
>> deleted docs or partially full segments.
>>
>>    
>
> I've got quite a few local patches unfortunately.  It will take a while
> to sync up.  If I don't already have this new logic, can I pick it up by
> just merging with the latest IndexWriter or are the changes more extensive?
>  
I must already have the new merge logic as the only diff between my
IndexWriter and latest svn is the change just made to make
flushRamSegments public.

Yonik, thanks for your help.  This should work well!

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Michael Busch
In reply to this post by Yonik Seeley-2
I had the same problem with large documents causing memory problems. I
solved this problem by introducing a new setting in IndexWriter
setMaxBufferSize(long). Now a merge is either triggered when
bufferedDocs==maxBufferedDocs *or* the size of the bufferedDocs >=
maxBufferSize. I made these changes based on the new merge policy Yonik
mentioned, so if anyone is interested I could open a Jira issue and
submit a patch.

- Michael


Yonik Seeley wrote:

> On 11/9/06, Chuck Williams <[hidden email]> wrote:
>> Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
>> just am making flushRamSegments() public and calling it externally
>> (properly synchronized), earlier than it would otherwise be called from
>> ongoing addDocument-driven merging.
>>
>> Sounds like this should work.
>
> Yep.
> For best behavior, you probably want to be using the current
> (svn-trunk) version of Lucene with the new merge policy.  It ensures
> there are mergeFactor segments with size <= maxBufferedDocs before
> triggering a merge.  This makes for faster indexing in the presence of
> deleted docs or partially full segments.
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Chuck Williams-2
This sounds good.  Michael, I'd love to see your patch,

Chuck


Michael Busch wrote on 11/09/2006 09:13 AM:

> I had the same problem with large documents causing memory problems. I
> solved this problem by introducing a new setting in IndexWriter
> setMaxBufferSize(long). Now a merge is either triggered when
> bufferedDocs==maxBufferedDocs *or* the size of the bufferedDocs >=
> maxBufferSize. I made these changes based on the new merge policy
> Yonik mentioned, so if anyone is interested I could open a Jira issue
> and submit a patch.
>
> - Michael
>
>
> Yonik Seeley wrote:
>> On 11/9/06, Chuck Williams <[hidden email]> wrote:
>>> Thanks Yonik!  Poor wording on my part.  I won't vary maxBufferedDocs,
>>> just am making flushRamSegments() public and calling it externally
>>> (properly synchronized), earlier than it would otherwise be called from
>>> ongoing addDocument-driven merging.
>>>
>>> Sounds like this should work.
>>
>> Yep.
>> For best behavior, you probably want to be using the current
>> (svn-trunk) version of Lucene with the new merge policy.  It ensures
>> there are mergeFactor segments with size <= maxBufferedDocs before
>> triggering a merge.  This makes for faster indexing in the presence of
>> deleted docs or partially full segments.
>>
>> -Yonik
>> http://incubator.apache.org/solr Solr, the open-source Lucene search
>> server
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Michael Busch

> This sounds good.  Michael, I'd love to see your patch,
>
> Chuck

Ok, I'll probably need a few days before I can submit it (have to code
unit tests and check if it compiles with the current head), because I'm
quite busy with other stuff right now. But you will get it soon :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Dynamically varying maxBufferedDocs

Chuck Williams-2
Michael Busch wrote on 11/09/2006 09:56 AM:
>
>> This sounds good.  Michael, I'd love to see your patch,
>>
>> Chuck
>
> Ok, I'll probably need a few days before I can submit it (have to code
> unit tests and check if it compiles with the current head), because
> I'm quite busy with other stuff right now. But you will get it soon :-)

I've just written my patch and will submit it too once it is fully
tested.  I took this approach:

   1. Add sizeInBytes() to RAMDirectory
   2. Make flushRamSegments() plus new numRamDocs() and ramSizeInBytes()
      public in IndexWriter


This does not provide the facility in IndexWriter, but it does provide a
nice api to manage this externally.  I didn't do it in IndexWriter for
two reasons:

   1. I use ParallelWriter, which has to manage this differently
   2. There is no general mechanism in lucene to size documents.  I use
      have an interface for my readers in reader-valued fields to
      support this.


In general, there are things the application knows that lucene doesn't
know that help to manage the size bounds

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]