[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: index.presharedstores.nocfs.zip
                index.presharedstores.cfs.zip

Oh, were the test failures only in the TestBackwardsCompatibility?

Because I changed the index file format, I added 2 more ZIP files to
that unit test, but, "svn diff" doesn't pick up the new zip files.  So
I'm attaching them.  Can you pull off these zip files into your
src/test/org/apache/lucene/index and test again?  Thanks.



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506609 ]

Steven Parkes commented on LUCENE-843:
--------------------------------------

Yeah, that was it.

I'll be delving more into the code as I try to figure out how it will dove tail with the merge policy factoring.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506718 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

> Yeah, that was it.

Phew!

> I'll be delving more into the code as I try to figure out how it will
> dove tail with the merge policy factoring.

OK, thanks.  I am very eager to get some other eyeballs looking for
issues with this patch!

I *think* this patch and the merge policy refactoring should be fairly
separate.

With this patch, "flushing" RAM -> Lucene segment is no longer a
"mergeSegments" call which I think simplifies IndexWriter.  Previously
mergeSegments had lots of extra logic to tell if it was merging RAM
segments (= a flush) vs merging "real" segments but now it's simpler
because mergeSegments really only merges segments.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506752 ]

Michael Busch commented on LUCENE-843:
--------------------------------------

Hi Mike,

my first comment on this patch is: Impressive!

It's also quite overwhelming at the beginning, but I'm trying to dig into it. I'll probably have more questions, here's the first one:

Does DocumentsWriter also solve the problem DocumentWriter had before LUCENE-880? I believe the answer is yes. Even though you close the TokenStreams in the finally clause of invertField() like DocumentWriter did before 880 this is safe, because addPosition() serializes the term strings and payload bytes into the posting hash table right away. Is that right?

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506778 ]

Michael Busch commented on LUCENE-843:
--------------------------------------

Mike,

the benchmarks you run focus on measuring the pure indexing performance. I think it would be interesting to know how big the speedup is in real-life scenarios, i. e. with StandardAnalyzer and maybe even HTML parsing? For sure the speedup will be less, but it should still be a significant improvement. Did you run those kinds of benchmarks already?

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506811 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

> Does DocumentsWriter also solve the problem DocumentWriter had
> before LUCENE-880? I believe the answer is yes. Even though you
> close the TokenStreams in the finally clause of invertField() like
> DocumentWriter did before 880 this is safe, because addPosition()
> serializes the term strings and payload bytes into the posting hash
> table right away. Is that right?

That's right.  When I merged in the fix for LUCENE-880, I realized
with this patch it's fine to close the token stream immediately after
processing all of its tokens because everything about the token stream
has been "absorbed" into postings hash.

> the benchmarks you run focus on measuring the pure indexing
> performance. I think it would be interesting to know how big the
> speedup is in real-life scenarios, i. e. with StandardAnalyzer and
> maybe even HTML parsing? For sure the speedup will be less, but it
> should still be a significant improvement. Did you run those kinds
> of benchmarks already?

Good question ... I haven't measured the performance cost of using
StandardAnalyzer or HTML parsing but I will test & post back.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506907 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

OK I ran tests comparing analyzer performance.

It's the same test framework as above, using the ~5,500 byte Europarl
docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
vectors, and CFS=false, indexing 200,000 documents.

The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
GC cost by not allocating a Term or String for every token in every
document.

Each run is best time of 2 runs:

  ANALYZER            PATCH (sec) TRUNK (sec)  SPEEDUP
  SimpleSpaceAnalyzer  79.0       326.5        4.1 X
  StandardAnalyzer    449.0       674.1        1.5 X
  WhitespaceAnalyzer  104.0       338.9        3.3 X
  SimpleAnalyzer      104.7       328.0        3.1 X

StandardAnalyzer is definiteely rather time consuming!


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506961 ]

Michael Busch commented on LUCENE-843:
--------------------------------------

> OK I ran tests comparing analyzer performance.

Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer
but 1.5X faster is still very good!


I have some question about the extensibility of your code. For flexible
indexing we want to be able in the future to implement different posting
formats and we might even want to allow our users to implement own
posting formats.

When I implemented multi-level skipping I tried to keep this in mind.
Therefore I put most of the functionality in the two abstract classes
MultiLevelSkipListReader/Writer. Subclasses implement the actual format
of the skip data. I think with this design it should be quite easy to
implement different formats in the future while limiting the code
complexity.

With the old DocumentWriter I think this is quite simple to do too by
adding a class like PostingListWriter, where subclasses define the actual
format (because DocumentWriter is so simple).

Do you think your code is easily extensible in this regard? I'm
wondering because of all the optimizations you're doing like e. g.
sharing byte arrays. But I'm certainly not familiar enough with your code
yet, so I'm only guessing here.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506974 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

> Do you think your code is easily extensible in this regard? I'm
> wondering because of all the optimizations you're doing like e. g.
> sharing byte arrays. But I'm certainly not familiar enough with your code
> yet, so I'm only guessing here.

Good question!

DocumentsWriter is definitely more complex than DocumentWriter, but it
doesn't prevent extensibility and I think will work very well when we
do flexible indexing.

The patch now has dedicated methods for writing into the freq/prox/etc
streams ('writeFreqByte', 'writeFreqVInt', 'writeProxByte',
'writeProxVInt', etc.), but, this could easily be changed to instead
use true IndexOutput streams.  This would then hide all details of
shared byte arrays from whoever is doing the writing.

The way I roughly see flexible indexing working in the future is
DocumentsWriter will be responsible for keeping track of unique terms
seen (in its hash table), holding the Posting instance (which could be
subclassed in the future) for each term, flushing a real segment when
full, handling shared byte arrays, etc.  Ie all the "infrastructure".

But then the specific logic of what bytes are written into which
streams (freq/prox/vectors/others) will be handled by a separate class
or classes that we can plug/unplug according to some "schema".
DocumentsWriter would call on these classes and provide the
IndexOutput's for all streams for the Posting, per position, and these
classes write their own format into the IndexOutputs.

I think a separation like that would work well: we could have good
performance and also extensibility.  Devil is in the details of
course...

I obviously haven't factored DocumentsWriter in this way (it has its
own addPosition that writes the current Lucene index format) but I
think this is very doable in the future.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Grant Ingersoll-4
In reply to this post by ASF GitHub Bot (Jira)
Hi Michael,

I know you've got your hands full, but was wondering if you could  
either post your benchmark code, or better yet, hook it into the  
benchmarker contrib (it is quite easy).

Let me know if I can help,
Grant

On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:

>
>     [ https://issues.apache.org/jira/browse/LUCENE-843?
> page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel#action_12506907 ]
>
> Michael McCandless commented on LUCENE-843:
> -------------------------------------------
>
> OK I ran tests comparing analyzer performance.
>
> It's the same test framework as above, using the ~5,500 byte Europarl
> docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
> vectors, and CFS=false, indexing 200,000 documents.
>
> The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
> GC cost by not allocating a Term or String for every token in every
> document.
>
> Each run is best time of 2 runs:
>
>   ANALYZER            PATCH (sec) TRUNK (sec)  SPEEDUP
>   SimpleSpaceAnalyzer  79.0       326.5        4.1 X
>   StandardAnalyzer    449.0       674.1        1.5 X
>   WhitespaceAnalyzer  104.0       338.9        3.3 X
>   SimpleAnalyzer      104.7       328.0        3.1 X
>
> StandardAnalyzer is definiteely rather time consuming!
>
>
>> improve how IndexWriter uses RAM to buffer added documents
>> ----------------------------------------------------------
>>
>>                 Key: LUCENE-843
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: Index
>>    Affects Versions: 2.2
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>            Priority: Minor
>>         Attachments: index.presharedstores.cfs.zip,  
>> index.presharedstores.nocfs.zip, LUCENE-843.patch,  
>> LUCENE-843.take2.patch, LUCENE-843.take3.patch,  
>> LUCENE-843.take4.patch, LUCENE-843.take5.patch,  
>> LUCENE-843.take6.patch, LUCENE-843.take7.patch,  
>> LUCENE-843.take8.patch, LUCENE-843.take9.patch
>>
>>
>> I'm working on a new class (MultiDocumentWriter) that writes more  
>> than
>> one document directly into a single Lucene segment, more efficiently
>> than the current approach.
>> This only affects the creation of an initial segment from added
>> documents.  I haven't changed anything after that, eg how segments  
>> are
>> merged.
>> The basic ideas are:
>>   * Write stored fields and term vectors directly to disk (don't
>>     use up RAM for these).
>>   * Gather posting lists & term infos in RAM, but periodically do
>>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>>     merge them later when it's time to make a real segment).
>>   * Recycle objects/buffers to reduce time/stress in GC.
>>   * Other various optimizations.
>> Some of these changes are similar to how KinoSearch builds a segment.
>> But, I haven't made any changes to Lucene's file format nor added
>> requirements for a global fields schema.
>> So far the only externally visible change is a new method
>> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
>> deprecated) so that it flushes according to RAM usage and not a fixed
>> number documents added.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

Hi Grant,

The benchmarking code I've been using is in all but the first & last
patches I attached on LUCENE-843.  Really it's just a modified version
of the demo IndexFiles code, plus a new analyzer (SimpleSpaceAnalyzer)
that is the same as WhitespaceAnalyzer except it re-uses Token/String
instead of allocating a new one for each term.

But, I'd also like to port these into the benchmark contrib framework.
My plan is to make a new DocMaker that knows how to read documents
"line by line" from a previously created file, to not pay the IO cost
of opening a separate file per document, and then make a new class
(maybe a task?) that can read documents from a DocMaker and write a
single file with one document per line.

I just haven't quite gotten to this yet, but I will :)

Mike

"Grant Ingersoll" <[hidden email]> wrote:

> Hi Michael,
>
> I know you've got your hands full, but was wondering if you could  
> either post your benchmark code, or better yet, hook it into the  
> benchmarker contrib (it is quite easy).
>
> Let me know if I can help,
> Grant
>
> On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:
>
> >
> >     [ https://issues.apache.org/jira/browse/LUCENE-843?
> > page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> > tabpanel#action_12506907 ]
> >
> > Michael McCandless commented on LUCENE-843:
> > -------------------------------------------
> >
> > OK I ran tests comparing analyzer performance.
> >
> > It's the same test framework as above, using the ~5,500 byte Europarl
> > docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
> > vectors, and CFS=false, indexing 200,000 documents.
> >
> > The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
> > GC cost by not allocating a Term or String for every token in every
> > document.
> >
> > Each run is best time of 2 runs:
> >
> >   ANALYZER            PATCH (sec) TRUNK (sec)  SPEEDUP
> >   SimpleSpaceAnalyzer  79.0       326.5        4.1 X
> >   StandardAnalyzer    449.0       674.1        1.5 X
> >   WhitespaceAnalyzer  104.0       338.9        3.3 X
> >   SimpleAnalyzer      104.7       328.0        3.1 X
> >
> > StandardAnalyzer is definiteely rather time consuming!
> >
> >
> >> improve how IndexWriter uses RAM to buffer added documents
> >> ----------------------------------------------------------
> >>
> >>                 Key: LUCENE-843
> >>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
> >>             Project: Lucene - Java
> >>          Issue Type: Improvement
> >>          Components: Index
> >>    Affects Versions: 2.2
> >>            Reporter: Michael McCandless
> >>            Assignee: Michael McCandless
> >>            Priority: Minor
> >>         Attachments: index.presharedstores.cfs.zip,  
> >> index.presharedstores.nocfs.zip, LUCENE-843.patch,  
> >> LUCENE-843.take2.patch, LUCENE-843.take3.patch,  
> >> LUCENE-843.take4.patch, LUCENE-843.take5.patch,  
> >> LUCENE-843.take6.patch, LUCENE-843.take7.patch,  
> >> LUCENE-843.take8.patch, LUCENE-843.take9.patch
> >>
> >>
> >> I'm working on a new class (MultiDocumentWriter) that writes more  
> >> than
> >> one document directly into a single Lucene segment, more efficiently
> >> than the current approach.
> >> This only affects the creation of an initial segment from added
> >> documents.  I haven't changed anything after that, eg how segments  
> >> are
> >> merged.
> >> The basic ideas are:
> >>   * Write stored fields and term vectors directly to disk (don't
> >>     use up RAM for these).
> >>   * Gather posting lists & term infos in RAM, but periodically do
> >>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
> >>     merge them later when it's time to make a real segment).
> >>   * Recycle objects/buffers to reduce time/stress in GC.
> >>   * Other various optimizations.
> >> Some of these changes are similar to how KinoSearch builds a segment.
> >> But, I haven't made any changes to Lucene's file format nor added
> >> requirements for a global fields schema.
> >> So far the only externally visible change is a new method
> >> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> >> deprecated) so that it flushes according to RAM usage and not a fixed
> >> number documents added.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507567 ]

Doron Cohen commented on LUCENE-843:
------------------------------------

Mike, I am considering testing the performance of this patch on a somewhat different use case, real one I think. After indexing 25M docs of TREC .gov2 (~500GB of docs) I pushed the index terms to create a spell correction index, by using the contrib spell checker. Docs here are *very* short - For each index term a document is created, containing some N-GRAMS. On the specific machine I used there are 2 CPUs but the SpellChecker indexing does not take advantage of that. Anyhow, 126,684,685 words==documents were indexed.
For the docs adding step I had:
    mergeFactor = 100,000
    maxBufferedDocs = 10,000
So no merging took place.
This step took 21 hours, and created 12,685 segments, total size 15 - 20 GB.
Then I optimized the index with
    mergeFactor = 400
(Larger values were hard on the open files limits.)

I thought it would be interesting to see how the new code performs in this scenario, what do you think?

If you too find this comparison interesting, I have two more questions:
  - what settings do you recommend?
  - is there any chance for speed-up in optimize()?  I didn't read your
    new code yet, but at least from some comments here it seems that
    on disk merging was not changed... is this (still) so? I would skip the
    optimize part if this is not of interest for the comparison. (In fact I am
    still waiting for my optimize() to complete, but if it is not of interest I
    will just interrupt it...)

Thanks,
Doron


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507587 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


> I thought it would be interesting to see how the new code performs in this scenario, what do you think?

Yes I'd be very interested to see the results of this.  It's a
somewhat "unusual" indexing situation (such tiny docs) but it's a real
world test case.  Thanks!

>  - what settings do you recommend?

I think these are likely the important ones in this case:

  * Flush by RAM instead of doc count
    (writer.setRAMBufferSizeMB(...)).

  * Give it as much RAM as you can.

  * Use maybe 3 indexing threads (if you can).

  * Turn off compound file.

  * If you have stored fields/vectors (seems not in this case) use
    autoCommit=false.

  * Use a trivial analyzer that doesn't create new String/new Token
    (re-use the same Token, and use the char[] based term text
    storage instead of the String one).

  * Re-use Document/Field instances.  The DocumentsWriter is fine with
    this and it saves substantial time from GC especially because your
    docs are so tiny (per-doc overhead is otherwise a killer).  In
    IndexLineFiles I made a StringReader that lets me reset its String
    value; this way I didn't have to change the Field instances stored
    in the Document.

>  - is there any chance for speed-up in optimize()?  I didn't read
>    your new code yet, but at least from some comments here it seems
>    that on disk merging was not changed... is this (still) so? I would

Correct: my patch doesn't touch merging and optimizing.  All it does
now is gather many docs in RAM and then flush a new segment when it's
time.  I've opened a separate issue (LUCENE-856) for optimizations
in segment merging.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507708 ]

Doron Cohen commented on LUCENE-843:
------------------------------------

Just to clarify your comment on reusing field and doc instances - to my understanding reusing a field instance is ok *only* after the containing doc was added to the index.

For a "fair" comparison I ended up not following most of your recommendations, including the reuse field/docs one and the non-compound one (apologies:-)), but I might use them later.

For the first 100,000,000 docs (==speller words) the speed-up is quite amazing:
    Orig:    Speller: added 100000000 words in 10912 seconds = 3 hours 1 minutes 52 seconds
    New:   Speller: added 100000000 words in 58490 seconds = 16 hours 14 minutes 50 seconds
This is 5.3 times faster !!!

This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
I stopped the run now, I don't expect to learn anything new by letting it continue.

When trying with  MEM=512MB, it at first seemed faster, but then there were now and then local slow-downs, and eventually it became a bit slower than the previous run. I know these are not merges, so they are either flushes (RAM directed), or GC activity. I will perhaps run with GC debug flags and perhaps add a print at flush so to tell the culprit for these local slow-downs.

Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs) with this patch. The way I indexed it before it took about 4 days - running in 4 threads, and creating 36 indexes. This is even more a real life scenario, it involves HTML parsing, standard analysis, and merging (to some extent). Since there are 4 threads each one will get, say, 250MB. Again, for a "fair" comparison, I will remain with compound.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507716 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


> Just to clarify your comment on reusing field and doc instances - to my
> understanding reusing a field instance is ok *only* after the containing
> doc was added to the index.

Right, if your documents are very "regular" you should get a sizable
speedup (especially for tiny docs), with or without this patch, if you
make a single Document and add *separate* Field instances to it for
each field, and then reuse the Document and Field instances for all
the docs you want to add.

It's not easy to reuse Field instances now (there's no
setStringValue()).  I made a ReusableStringReader to do this but you
could also make your own class that implements Fieldable.

> For a "fair" comparison I ended up not following most of your
> recommendations, including the reuse field/docs one and the non-compound
> one (apologies:-)), but I might use them later.

OK, when you say "fair" I think you mean because you already had a
previous run that used compound file, you had to use compound file in
the run with the LUCENE-843 patch (etc)?  The recommendations above
should speed up Lucene with or without my patch.

> For the first 100,000,000 docs (==speller words) the speed-up is quite
> amazing:
>     Orig:    Speller: added 100000000 words in 10912 seconds = 3 hours 1
>     minutes 52 seconds
>     New:   Speller: added 100000000 words in 58490 seconds = 16 hours 14
>     minutes 50 seconds
> This is 5.3 times faster !!!

Wow!  I think the speedup might be even more if both of your runs followed
the suggestions above.

> This btw was with maxBufDocs=100,000 (I forgot to set the MEM param).
> I stopped the run now, I don't expect to learn anything new by letting it
> continue.
>
> When trying with  MEM=512MB, it at first seemed faster, but then there
> were now and then local slow-downs, and eventually it became a bit slower
> than the previous run. I know these are not merges, so they are either
> flushes (RAM directed), or GC activity. I will perhaps run with GC debug
> flags and perhaps add a print at flush so to tell the culprit for these
> local slow-downs.

Hurm, odd.  I haven't pushed RAM buffer up to 512 MB so it could be GC
cost somehow makes things worse ... curious.

> Other than that, I will perhaps try to index .GOV2 (25 Million HTML docs)
> with this patch. The way I indexed it before it took about 4 days -
> running in 4 threads, and creating 36 indexes. This is even more a real
> life scenario, it involves HTML parsing, standard analysis, and merging
> (to some extent). Since there are 4 threads each one will get, say,
> 250MB. Again, for a "fair" comparison, I will remain with compound.

OK, because you're doing StandardAnalyzer and HTML parsing and
presumably loading one-doc-per-file, most of your time is spent
outside of Lucene indexing so I'd expect less that 50% speedup in
this case.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Doron Cohen
Michael McCandless wrote:

> OK, when you say "fair" I think you mean because you already had a
> previous run that used compound file, you had to use compound file in
> the run with the LUCENE-843 patch (etc)?

Yes, that's true.

> The recommendations above should speed up Lucene with or without my
patch.

Sure, makes sense.

> > When trying with  MEM=512MB, it at first seemed faster, but then there
> > were now and then local slow-downs, and eventually it became
> a bit slower
> > than the previous run. I know these are not merges, so they are either
> > flushes (RAM directed), or GC activity. I will perhaps run
> with GC debug
> > flags and perhaps add a print at flush so to tell the culpritfor these
> > local slow-downs.
>
> Hurm, odd.  I haven't pushed RAM buffer up to 512 MB so it could be GC
> cost somehow makes things worse ... curious.

Ok I tried this with -verbose=gc and GC seems ok here -

Here is a log snippet for the RAM=512MB setting run:

---------- log start -----------------
Speller: added 3290000 words in 241 seconds = 4 minutes 1 seconds
Speller: added 3300000 words in 241 seconds = 4 minutes 1 seconds
Speller: added 3310000 words in 242 seconds = 4 minutes 2 seconds
Speller: added 3320000 words in 242 seconds = 4 minutes 2 seconds

  RAM: now flush @ usedMB=512.012 allocMB=512.012 triggerMB=512
  flush: flushDocs=true flushDeletes=false flushDocStores=true
numDocs=3328167

closeDocStore: 2 files to flush to segment _0

flush postings as segment _0 numDocs=3328167
<af type="tenured" id="40" timestamp="Mon Jun 25 22:50:18 2007"
intervalms="7451.237">
  <minimum requested_bytes="955824" />
  <time exclusiveaccessms="0.039" />
  <tenured freebytes="59197944" totalbytes="1572864000" percent="3" >
    <soa freebytes="59197944" totalbytes="1572864000" percent="3" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <gc type="global" id="40" totalid="40" intervalms="7451.371">
    <compaction movecount="7497075" movebytes="1061942632" reason="compact
to meet allocation" />
    <refs_cleared soft="0" threshold="32" weak="0" phantom="0" />
    <finalization objectsqueued="1" />
    <timesms mark="2507.521" sweep="25.223" compact="3992.243"
total="6525.114" />
    <tenured freebytes="297819120" totalbytes="1572864000" percent="18" >
      <soa freebytes="297819120" totalbytes="1572864000" percent="18" />
      <loa freebytes="0" totalbytes="0" percent="0" />
    </tenured>
  </gc>
  <tenured freebytes="296863296" totalbytes="1572864000" percent="18" >
    <soa freebytes="296863296" totalbytes="1572864000" percent="18" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <time totalms="6525.340" />
</af>

<af type="tenured" id="41" timestamp="Mon Jun 25 22:51:16 2007"
intervalms="51348.671">
  <minimum requested_bytes="24" />
  <time exclusiveaccessms="0.045" />
  <tenured freebytes="0" totalbytes="1572864000" percent="0" >
    <soa freebytes="0" totalbytes="1572864000" percent="0" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <gc type="global" id="41" totalid="41" intervalms="51348.857">
    <refs_cleared soft="0" threshold="32" weak="0" phantom="0" />
    <finalization objectsqueued="0" />
    <timesms mark="1369.963" sweep="22.935" compact="0.000"
total="1392.988" />
    <tenured freebytes="294024904" totalbytes="1572864000" percent="18" >
      <soa freebytes="294024904" totalbytes="1572864000" percent="18" />
      <loa freebytes="0" totalbytes="0" percent="0" />
    </tenured>
  </gc>
  <tenured freebytes="294024392" totalbytes="1572864000" percent="18" >
    <soa freebytes="294024392" totalbytes="1572864000" percent="18" />
    <loa freebytes="0" totalbytes="0" percent="0" />
  </tenured>
  <time totalms="1393.264" />
</af>

  oldRAMSize=339488720 newFlushedSize=299712406 docs/MB=11,643.949
new/old=88.283%
org.apache.lucene.index.IndexFileDeleter@30463046 main: now checkpoint
"segments_2" [1 segments ; isCommit = true]
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.fnm":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.frq":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.prx":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.tis":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.tii":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.nrm":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.fdx":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.fdt":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main: deleteCommits: now
remove commit "segments_1"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef
"segments_1": pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "segments_1"
org.apache.lucene.index.IndexFileDeleter@30463046 main: now checkpoint
"segments_3" [1 segments ; isCommit = true]
org.apache.lucene.index.IndexFileDeleter@30463046 main:   IncRef "_0.cfs":
pre-incr count is 0
org.apache.lucene.index.IndexFileDeleter@30463046 main: deleteCommits: now
remove commit "segments_2"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.fnm":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.fnm"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.frq":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.frq"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.prx":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.prx"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.tis":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.tis"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.tii":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.tii"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.nrm":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.nrm"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.fdx":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.fdx"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef "_0.fdt":
pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "_0.fdt"
org.apache.lucene.index.IndexFileDeleter@30463046 main:   DecRef
"segments_2": pre-decr count is 1
org.apache.lucene.index.IndexFileDeleter@30463046 main: delete "segments_2"
Speller: added 3330000 words in 339 seconds = 5 minutes 39 seconds
Speller: added 3340000 words in 340 seconds = 5 minutes 40 seconds
Speller: added 3350000 words in 341 seconds = 5 minutes 41 seconds
---------- log start -----------------

So there is about a 100 seconds gap, out of which 8 seconds are GC, rest is
the flush. I am not saying this is a problem, just bringing the info. The
behavior along the run seems similar - that was the first flush, after
adding 3.3M docs (words). The next flush was after adding 6.5M docs, ~100
secs again, similar GC/flush times. So I guess it makes sense one have to
pay some time for flushing that large number of added docs. It is
interesting that beyond a certain value there's no point in allowing more
RAM, question is what would be the recommended value... I sort of followed
(all the way:-)) the "let it have as much memory as possible" - guess the
best recommendation should be lower than that.

>
> > Other than that, I will perhaps try to index .GOV2 (25
> Million HTML docs)
> > with this patch. The way I indexed it before it took about 4 days -
> > running in 4 threads, and creating 36 indexes. This is even more a real
> > life scenario, it involves HTML parsing, standard analysis, and merging
> > (to some extent). Since there are 4 threads each one will get, say,
> > 250MB. Again, for a "fair" comparison, I will remain with compound.
>
> OK, because you're doing StandardAnalyzer and HTML parsing and
> presumably loading one-doc-per-file, most of your time is spent
> outside of Lucene indexing so I'd expect less that 50% speedup in
> this case.

~25M docs are in ~27K zip files, so IO side should be brighter...


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-843.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.3
    Lucene Fields: [New, Patch Available]  (was: [New])

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Reopened: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reopened LUCENE-843:
---------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Re-opening this issue: I saw one failure of the contrib/benchmark
TestPerfTasksLogic.testParallelDocMaker() tests due to an intermittant
thread-safety issue.  It's hard to get the failure to happen (it's
happened only once in ~20 runs of contrib/benchmark) but I see where
the issue is.  Will commit a fix shortly.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512264 ]

Steven Parkes commented on LUCENE-843:
--------------------------------------

Did we lose the triggered merge stuff from 887, i.e.,, should it be

        if (triggerMerge) {
          /* new merge policy
          if (0 == docWriter.getMaxBufferedDocs())
            maybeMergeSegments(mergeFactor * numDocs / 2);
          else
            maybeMergeSegments(docWriter.getMaxBufferedDocs());
          */
          maybeMergeSegments(docWriter.getMaxBufferedDocs());
        }
 

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512275 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

Woops ... you are right; thanks for catching it!  I will add a unit
test & fix it.  I will also make the flush(boolean triggerMerge,
boolean flushDocStores) protected, not public, and move the javadoc
back to the public flush().


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: index.presharedstores.cfs.zip, index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12345