[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Grant Ingersoll-2
I've only been loosely following this...

Do you think it is possible to separate the stored/term vector  
handling into a separate patch against the current trunk?  This seems  
like a quick win and I know it has been speculated about before.

On Mar 23, 2007, at 12:00 PM, Michael McCandless wrote:

>
> "Yonik Seeley" <[hidden email]> wrote:
>> On 3/22/07, Michael McCandless <[hidden email]> wrote:
>>> Merging is costly because you read all data in then write all data
>>> out, so, you want to minimize for byte of data in the index in the
>>> index how many times it will be "serviced" (read in, written out) as
>>> part of a merge.
>>
>> Avoiding the re-writing of stored fields might be nice:
>> http://www.nabble.com/Re%3A--jira--Commented%3A-%28LUCENE-565%29- 
>> Supporting-deleteDocuments-in-IndexWriter-%28Code-and-Performance-
>> Results-Provided%29-p6177280.html
>
> That's exactly the approach I'm taking in LUCENE-843: stored fields  
> and term
> vectors are immediately written to disk.  Only frq, prx and tis use up
> memory.  This greatly extends how many docs you can buffer before
> having to flush (assuming your docs have stored fields and term
> vectors).
>
> When memory is full, I either flush a segment to disk (when writer is
> in autoCommit=true mode), else I flush the data to tmp files which are
> finally merged into a segment when the writer is closed.  This merging
> is less costly because the bytes in/out are just frq, prx and tis, so
> this improves performance of autoCommit=false mode vs autoCommit=true
> mode.
>
> But, this is only for the segment created from buffered docs (ie the
> segment created by a "flush").  Subsequent merges still must copy
> bytes in/out and in LUCENE-843 I haven't changed anything about how
> segments are merged.
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

"Grant Ingersoll" <[hidden email]> wrote:
> I've only been loosely following this...
>
> Do you think it is possible to separate the stored/term vector  
> handling into a separate patch against the current trunk?  This seems  
> like a quick win and I know it has been speculated about before.

This is definitely possible, but, I'd rather just do this as part of
LUCENE-843 (I don't think I'm too too far from iterating it to a good
point).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take2.patch

New rev of the patch:

  * Fixed at least one data corruption case

  * Added more asserts (run with "java -ea" so asserts run)

  * Some more small optimizations

  * Updated to current trunk so patch applies cleanly



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take3.patch


Another rev of the patch:

  * Got thread concurrency working: removed "synchronized" from entire
    call to MultiDocWriter.addDocument and instead synchronize two
    quick steps (init/finish) addDocument leaving the real work
    (processDocument) unsynchronized.

  * Fixed bug that was failing to delete temp files from index

  * Reduced memory usage of Posting by inlining positions, start
    offset, end offset into a single int array.

  * Enabled IndexLineFiles.java (tool I use for local benchmarking) to
    run multiple threads

  * Other small optimizations

BTW, one of the nice side effects of this patch is it cleans up the
mergeSegments method of IndexWriter by separating out "flush" of added
docs & deletions because it's no longer a merge, from the "true"
mergeSegments whose purpose is then to merge disk segments.
Previously mergeSegments was getting rather confusing with the
different cases/combinations of added docs or not, deleted docs or
not, any merges or not.



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take4.patch

Another rev of the patch.  All tests pass except disk full tests.  The
code is still rather "dirty" and not well commented.

I think I'm close to finishing optimizing and now I will focus on
error handling (eg disk full), adding some deeper unit tests, more
testing on corner cases like massive docs or docs with massive terms,
etc., flushing pending norms to disk, cleaning up / commenting the
code and various other smaller items.

Here are the changes in this rev:

  * A proposed backwards compatible change to the Token API to also
    allow the term text to be delivered as a slice (offset & length)
    into a char[] array instead of String.  With an analyzer/tokenizer
    that takes advantage of this, this was a decent performance gain
    in my local testing.  I've created a SimpleSpaceAnalyzer that only
    splits words at the space character to test this.

  * Added more asserts (run java -ea to enable asserts).  The asserts
    are quite useful and now often catch a bug I've introduced before
    the unit tests do.

  * Changed to custom int[] block buffering for postings to store
    freq, prox's and offsets.  With this buffering we no longer have
    to double the size of int[] arrays while adding positions, nor do
    we have to copy ints whenever we needs more space for these
    arrays.  Instead I allocate larger slices out of the shared int[]
    arrays.  This reduces memory and improves performance.

  * Changed to custom char[] block buffering for postings to store
    term text.  This also reduces memory and improves performance.

  * Changed to single file for RAM & flushed partial segments (was 3
    separate files before)

  * Changed how I merge flushed partial segments to match what's
    described in LUCENE-854

  * Reduced memory usage when indexing large docs (25 MB plain text
    each).  I'm still consuming more RAM in this case than the
    baseline (trunk) so I'm still working on this one ...

  * Fixed a slow memory leak when building large (20+ GB) indices



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486292 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

Some details on how I measure RAM usage: both the baseline (current
lucene trunk) and my patch have two general classes of RAM usage.

The first class, "document processing RAM", is RAM used while
processing a single doc. This RAM is re-used for each document (in the
trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly
re-use these objects) and how large it gets is driven by how big each
document is.

The second class, "indexed documents RAM", is the RAM used up by
previously indexed documents.  This RAM grows with each added
document and how large it gets is driven by the number and size of
docs indexed since the last flush.

So when I say the writer is allowed to use 32 MB of RAM, I'm only
measuring the "indexed documents RAM".  With trunk I do this by
calling ramSizeInBytes(), and with my patch I do the analagous thing
by measuring how many RAM buffers are held up storing previously
indexed documents.

I then define "RAM efficiency" (docs/MB) as how many docs we can hold
in "indexed documents RAM" per MB RAM, at the point that we flush to
disk.  I think this is an important metric because it drives how large
your initial (level 0) segments are.  The larger these segments are
then generally the less merging you need to do, for a given # docs in
the index.

I also measure overall RAM used in the JVM (using
MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush
except the last, to also capture the "document processing RAM", object
overhead, etc.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486293 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

To do the benchmarking I created a simple standalone tool
(demo/IndexLineFiles, in the last patch) that indexes one line at a
time from a large previously created file, optionally using multiple
threads.  I do it this way to minimize IO cost of pulling the document
source because I want to measure just indexing time as much as possible.

Each line is read and a doc is created with field "contents" that is
not stored, is tokenized, and optionally has term vectors with
position+offsets.  I also optionally add two small only-stored fields
("path" and "modified").  I think these are fairly trivial documents
compared to typical usage of Lucene.

For the corpus, I took Europarl's "en" content, stripped tags, and
processed into 3 files: one with 100 tokens per line (= ~550 bytes),
one with 1000 tokens per line (= ~5,500 bytes) and with 10000 tokens
per line (= ~55,000 bytes) plain text per line.

All settings (mergeFactor, compound file, etc.) are left at defaults.
I don't optimize the index in the end.  I'm using my new
SimpleSpaceAnalyzer (just splits token on the space character and
creates token text as slice into a char[] array instead of new
String(...)) to minimize the cost of tokenization.

I ran the tests with Java 1.5 on a Mac Pro quad (2 Intel CPUs, each
dual core) OS X box with 2 GB RAM.  I give java 1 GB heap (-Xmx1024m).


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486332 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

A couple more details on the testing: I run java -server to get all
optimizations in the JVM, and the IO system is a local OS X RAID 0 of
4 SATA drives.

Using the above tool I ran an initial set of benchmarks comparing old
(= Lucene trunk) vs new (= this patch), varying document size (~550
bytes to ~5,500 bytes to ~55,000 bytes of plain text from Europarl
"en").

For each document size I run 4 combinations of whether term vectors
and stored fields are on or off and whether autoCommit is true or
false.  I measure net docs/sec (= total # docs indexed divided by
total time taken), RAM efficiency (= avg # docs flushed with each
flush divided by RAM buffer size), and avg HEAP RAM usage before each
flush.

Here are the results for the 10K tokens (= ~55,000 bytes plain text)
per document:

  20000 DOCS @ ~55,000 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 200.3 secs
          index size = 358M

        new
          20000 docs in 126.0 secs
          index size = 356M

        Total Docs/sec:             old    99.8; new   158.7 [   59.0% faster]
        Docs/MB @ flush:            old    24.2; new    49.1 [  102.5% more]
        Avg RAM used (MB) @ flush:  old    74.5; new    36.2 [   51.4% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 202.7 secs
          index size = 358M

        new
          20000 docs in 120.0 secs
          index size = 354M

        Total Docs/sec:             old    98.7; new   166.7 [   69.0% faster]
        Docs/MB @ flush:            old    24.2; new    48.9 [  101.7% more]
        Avg RAM used (MB) @ flush:  old    74.3; new    37.0 [   50.2% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 374.7 secs
          index size = 1.4G

        new
          20000 docs in 236.1 secs
          index size = 1.4G

        Total Docs/sec:             old    53.4; new    84.7 [   58.7% faster]
        Docs/MB @ flush:            old    10.2; new    49.1 [  382.8% more]
        Avg RAM used (MB) @ flush:  old   129.3; new    36.6 [   71.7% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 385.7 secs
          index size = 1.4G

        new
          20000 docs in 182.8 secs
          index size = 1.4G

        Total Docs/sec:             old    51.9; new   109.4 [  111.0% faster]
        Docs/MB @ flush:            old    10.2; new    48.9 [  380.9% more]
        Avg RAM used (MB) @ flush:  old    76.0; new    37.3 [   50.9% less]



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486334 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

Here are the results for "normal" sized docs (1K tokens = ~5,500 bytes plain text each):

  200000 DOCS @ ~5,500 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 397.6 secs
          index size = 415M

        new
          200000 docs in 167.5 secs
          index size = 411M

        Total Docs/sec:             old   503.1; new  1194.1 [  137.3% faster]
        Docs/MB @ flush:            old    81.6; new   406.2 [  397.6% more]
        Avg RAM used (MB) @ flush:  old    87.3; new    35.2 [   59.7% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 394.6 secs
          index size = 415M

        new
          200000 docs in 168.4 secs
          index size = 408M

        Total Docs/sec:             old   506.9; new  1187.7 [  134.3% faster]
        Docs/MB @ flush:            old    81.6; new   432.2 [  429.4% more]
        Avg RAM used (MB) @ flush:  old   126.6; new    36.9 [   70.8% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 754.2 secs
          index size = 1.7G

        new
          200000 docs in 304.9 secs
          index size = 1.7G

        Total Docs/sec:             old   265.2; new   656.0 [  147.4% faster]
        Docs/MB @ flush:            old    46.7; new   406.2 [  769.6% more]
        Avg RAM used (MB) @ flush:  old    92.9; new    35.2 [   62.1% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 743.9 secs
          index size = 1.7G

        new
          200000 docs in 244.3 secs
          index size = 1.7G

        Total Docs/sec:             old   268.9; new   818.7 [  204.5% faster]
        Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
        Avg RAM used (MB) @ flush:  old    93.0; new    36.6 [   60.6% less]





> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


Last is the results for small docs (100 tokens = ~550 bytes plain text each):

  2000000 DOCS @ ~550 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          2000000 docs in 886.7 secs
          index size = 438M

        new
          2000000 docs in 230.5 secs
          index size = 435M

        Total Docs/sec:             old  2255.6; new  8676.4 [  284.7% faster]
        Docs/MB @ flush:            old   128.0; new  4194.6 [ 3176.2% more]
        Avg RAM used (MB) @ flush:  old   107.3; new    37.7 [   64.9% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          2000000 docs in 888.7 secs
          index size = 438M

        new
          2000000 docs in 239.6 secs
          index size = 432M

        Total Docs/sec:             old  2250.5; new  8348.7 [  271.0% faster]
        Docs/MB @ flush:            old   128.0; new  4146.8 [ 3138.9% more]
        Avg RAM used (MB) @ flush:  old   108.1; new    38.9 [   64.0% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          2000000 docs in 1480.1 secs
          index size = 2.1G

        new
          2000000 docs in 462.0 secs
          index size = 2.1G

        Total Docs/sec:             old  1351.2; new  4329.3 [  220.4% faster]
        Docs/MB @ flush:            old    93.1; new  4194.6 [ 4405.7% more]
        Avg RAM used (MB) @ flush:  old   296.4; new    38.3 [   87.1% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          2000000 docs in 1489.4 secs
          index size = 2.1G

        new
          2000000 docs in 347.9 secs
          index size = 2.1G

        Total Docs/sec:             old  1342.8; new  5749.4 [  328.2% faster]
        Docs/MB @ flush:            old    93.1; new  4146.8 [ 4354.5% more]
        Avg RAM used (MB) @ flush:  old   297.1; new    38.6 [   87.0% less]



  200000 DOCS @ ~5,500 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 397.6 secs
          index size = 415M

        new
          200000 docs in 167.5 secs
          index size = 411M

        Total Docs/sec:             old   503.1; new  1194.1 [  137.3% faster]
        Docs/MB @ flush:            old    81.6; new   406.2 [  397.6% more]
        Avg RAM used (MB) @ flush:  old    87.3; new    35.2 [   59.7% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 394.6 secs
          index size = 415M

        new
          200000 docs in 168.4 secs
          index size = 408M

        Total Docs/sec:             old   506.9; new  1187.7 [  134.3% faster]
        Docs/MB @ flush:            old    81.6; new   432.2 [  429.4% more]
        Avg RAM used (MB) @ flush:  old   126.6; new    36.9 [   70.8% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 754.2 secs
          index size = 1.7G

        new
          200000 docs in 304.9 secs
          index size = 1.7G

        Total Docs/sec:             old   265.2; new   656.0 [  147.4% faster]
        Docs/MB @ flush:            old    46.7; new   406.2 [  769.6% more]
        Avg RAM used (MB) @ flush:  old    92.9; new    35.2 [   62.1% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 743.9 secs
          index size = 1.7G

        new
          200000 docs in 244.3 secs
          index size = 1.7G

        Total Docs/sec:             old   268.9; new   818.7 [  204.5% faster]
        Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
        Avg RAM used (MB) @ flush:  old    93.0; new    36.6 [   60.6% less]





> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486339 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

A few notes from these results:

  * A real Lucene app won't see these gains because frequently the
    retrieval of docs from the content source, and the tokenization,
    take substantial amounts of time whereas for this test I've
    intentionally minimized the cost of those steps but they are very
    low for this test because I'm 1) pulling one line at a time from a
    big text file, and 2) using my simplistic SimpleSpaceAnalyzer
    which just breaks tokens at the space character.

  * Best speedup is ~4.3X faster, for tiny docs (~550 bytes) with term
    vectors and stored fields enabled and using autoCommit=false.

  * Least speedup is still ~1.6X faster, for large docs (~55,000
    bytes) with autoCommit=true.

  * The autoCommit=false cases are a little unfair to the new patch
    because with the new patch, you get a single-segment (optimized)
    index in the end, but with existing Lucene trunk, you don't.

  * With term vectors and/or stored fields, autoCommit=false is quite
    a bit faster with the patch, because we never pay the price to
    merge them since they are written once.

  * With term vectors and/or stored fields, the new patch has
    substantially better RAM efficiency.

  * The patch is especially faster and has better RAM efficiency with
    smaller documents.

  * The actual HEAP RAM usage is quite a bit more stable with the
    patch, especially with term vectors & stored fields enabled.  I
    think this is because the patch creates far less garbage for GC to
    periodically reclaim.  I think this also means you could push your
    RAM buffer size even higher to get better performance.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486373 ]

Marvin Humphrey commented on LUCENE-843:
----------------------------------------

> The actual HEAP RAM usage is quite a bit more
> stable with the  patch, especially with term vectors
> & stored fields enabled. I think this is because the
> patch creates far less garbage for GC to periodically
> reclaim. I think this also means you could push your
> RAM buffer size even higher to get better performance.

For KinoSearch, the sweet spot seems to be a buffer of around 16 MB when benchmarking with the Reuters corpus on my G4 laptop. Larger than that and things actually slow down, unless the buffer is large enough that it never needs flushing. My hypothesis is that RAM fragmentation is slowing down malloc/free.  I'll be interested as to whether you see the same effect.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Ning Li-3
In reply to this post by Kenneth William Krugler (Jira)
On 4/3/07, Michael McCandless (JIRA) <[hidden email]> wrote:

>  * With term vectors and/or stored fields, the new patch has
>    substantially better RAM efficiency.

Impressive numbers! The new patch improves RAM efficiency quite a bit
even with no term vectors nor stored fields, because of the periodic
in-RAM merges of posting lists & term infos etc. The frequency of the
in-RAM merges is controlled by flushedMergeFactor, which measures in
doc count, right? How sensitive is performance to the value of
flushedMergeFactor?

Cheers,
Ning

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486385 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


>> The actual HEAP RAM usage is quite a bit more
>> stable with the patch, especially with term vectors
>> & stored fields enabled. I think this is because the
>> patch creates far less garbage for GC to periodically
>> reclaim. I think this also means you could push your
>> RAM buffer size even higher to get better performance.
>
> For KinoSearch, the sweet spot seems to be a buffer of around 16 MB
> when benchmarking with the Reuters corpus on my G4 laptop. Larger
> than that and things actually slow down, unless the buffer is large
> enough that it never needs flushing. My hypothesis is that RAM
> fragmentation is slowing down malloc/free. I'll be interested as to
> whether you see the same effect.

Interesting.  OK I will run the benchmark across increasing RAM sizes
to see where the sweet spot seems to be!


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2
In reply to this post by Ning Li-3

"Ning Li" <[hidden email]> wrote:

> On 4/3/07, Michael McCandless (JIRA) <[hidden email]> wrote:
>
> >  * With term vectors and/or stored fields, the new patch has
> >    substantially better RAM efficiency.
>
> Impressive numbers! The new patch improves RAM efficiency quite a bit
> even with no term vectors nor stored fields, because of the periodic
> in-RAM merges of posting lists & term infos etc. The frequency of the
> in-RAM merges is controlled by flushedMergeFactor, which measures in
> doc count, right? How sensitive is performance to the value of
> flushedMergeFactor?

Right, the in-RAM merges seem to help *alot* because you get great
compression of the terms dictionary, and also some compression of the
freq postings since the docIDs are delta encoded.  Also, you waste
less end buffer space (buffers are fixed sizes) when you merge together
into a large segment.

The in-RAM merges are triggered by number of bytes used vs RAM buffer
size.  Each doc is indexed to its own RAM segment, then once these
level 0 segments use > 1/Nth of the RAM buffer size, I merge into
level 1.  Then once level 1 segments are using > 1/Mth of the RAM
buffer size, I merge into level 2.  I don't do any merges beyond that.
Right now N = 14 and M = 7 but I haven't really tuned them yet ...

Once RAM is full, all of those segments are merged into a single
on-disk segment.  Once enough on-disk segments accumulate they are
periodically merged (based on flushedMergeFactor) as well.  Finally
when it's time to commit a real segment I merge all RAM segments and
flushed segments into a real Lucene segment.

I haven't done much testing to find sweet spot for these merge
settings just yet.  Still plenty to do!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Kenneth William Krugler (Jira)
In reply to this post by Kenneth William Krugler (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
    200000 docs in 862.2 secs
    index size = 1.7G

  new
    200000 docs in 297.1 secs
    index size = 1.7G

  Total Docs/sec:             old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:            old    47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old    34.5; new     3.4 [   90.1% less]



2 MB

  old
    200000 docs in 828.7 secs
    index size = 1.7G

  new
    200000 docs in 279.0 secs
    index size = 1.7G

  Total Docs/sec:             old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:            old    47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old    37.9; new     4.5 [   88.0% less]



4 MB

  old
    200000 docs in 840.5 secs
    index size = 1.7G

  new
    200000 docs in 260.8 secs
    index size = 1.7G

  Total Docs/sec:             old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:            old    46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old    33.9; new     6.5 [   80.9% less]



8 MB

  old
    200000 docs in 678.8 secs
    index size = 1.7G

  new
    200000 docs in 248.8 secs
    index size = 1.7G

  Total Docs/sec:             old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:            old    46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old    60.3; new    10.7 [   82.2% less]



16 MB

  old
    200000 docs in 660.6 secs
    index size = 1.7G

  new
    200000 docs in 247.3 secs
    index size = 1.7G

  Total Docs/sec:             old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:            old    46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old    47.1; new    19.2 [   59.3% less]



24 MB

  old
    200000 docs in 658.1 secs
    index size = 1.7G

  new
    200000 docs in 243.0 secs
    index size = 1.7G

  Total Docs/sec:             old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:            old    46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old    70.0; new    27.5 [   60.8% less]



32 MB

  old
    200000 docs in 714.2 secs
    index size = 1.7G

  new
    200000 docs in 239.2 secs
    index size = 1.7G

  Total Docs/sec:             old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old    92.5; new    36.7 [   60.3% less]



48 MB

  old
    200000 docs in 640.3 secs
    index size = 1.7G

  new
    200000 docs in 236.0 secs
    index size = 1.7G

  Total Docs/sec:             old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:            old    46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new    52.8 [   62.0% less]



64 MB

  old
    200000 docs in 649.3 secs
    index size = 1.7G

  new
    200000 docs in 238.3 secs
    index size = 1.7G

  Total Docs/sec:             old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:            old    46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new    72.7 [   76.0% less]



80 MB

  old
    200000 docs in 670.2 secs
    index size = 1.7G

  new
    200000 docs in 227.2 secs
    index size = 1.7G

  Total Docs/sec:             old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:            old    46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new    94.3 [   59.3% less]



96 MB

  old
    200000 docs in 683.4 secs
    index size = 1.7G

  new
    200000 docs in 226.8 secs
    index size = 1.7G

  Total Docs/sec:             old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:            old    46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against "new" because with the
    patch you get an optimized index in the end but with "old" you
    don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
    docs/sec @ 312.4.

  * New (with patch) seems to just get faster the more memory you give
    it, though gradually.  The peak was 96 MB (the largest I ran).  So
    no sweet spot (or maybe I need to give more memory, but, above 96
    MB the trunk was starting to swap on my test env).

  * New gets better and better RAM efficiency, the more RAM you give.
    This makes sense: it's better able to compress the terms dict, the
    more docs are merged in RAM before having to flush to disk.  I
    would also expect this curve to be somewhat content dependent.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

eks dev
wow, impressive numbers, congrats !

----- Original Message ----
From: Michael McCandless (JIRA) <[hidden email]>
To: [hidden email]
Sent: Thursday, 5 April, 2007 3:22:32 PM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents


    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
    200000 docs in 862.2 secs
    index size = 1.7G

  new
    200000 docs in 297.1 secs
    index size = 1.7G

  Total Docs/sec:             old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:            old    47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old    34.5; new     3.4 [   90.1% less]



2 MB

  old
    200000 docs in 828.7 secs
    index size = 1.7G

  new
    200000 docs in 279.0 secs
    index size = 1.7G

  Total Docs/sec:             old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:            old    47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old    37.9; new     4.5 [   88.0% less]



4 MB

  old
    200000 docs in 840.5 secs
    index size = 1.7G

  new
    200000 docs in 260.8 secs
    index size = 1.7G

  Total Docs/sec:             old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:            old    46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old    33.9; new     6.5 [   80.9% less]



8 MB

  old
    200000 docs in 678.8 secs
    index size = 1.7G

  new
    200000 docs in 248.8 secs
    index size = 1.7G

  Total Docs/sec:             old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:            old    46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old    60.3; new    10.7 [   82.2% less]



16 MB

  old
    200000 docs in 660.6 secs
    index size = 1.7G

  new
    200000 docs in 247.3 secs
    index size = 1.7G

  Total Docs/sec:             old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:            old    46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old    47.1; new    19.2 [   59.3% less]



24 MB

  old
    200000 docs in 658.1 secs
    index size = 1.7G

  new
    200000 docs in 243.0 secs
    index size = 1.7G

  Total Docs/sec:             old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:            old    46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old    70.0; new    27.5 [   60.8% less]



32 MB

  old
    200000 docs in 714.2 secs
    index size = 1.7G

  new
    200000 docs in 239.2 secs
    index size = 1.7G

  Total Docs/sec:             old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old    92.5; new    36.7 [   60.3% less]



48 MB

  old
    200000 docs in 640.3 secs
    index size = 1.7G

  new
    200000 docs in 236.0 secs
    index size = 1.7G

  Total Docs/sec:             old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:            old    46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new    52.8 [   62.0% less]



64 MB

  old
    200000 docs in 649.3 secs
    index size = 1.7G

  new
    200000 docs in 238.3 secs
    index size = 1.7G

  Total Docs/sec:             old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:            old    46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new    72.7 [   76.0% less]



80 MB

  old
    200000 docs in 670.2 secs
    index size = 1.7G

  new
    200000 docs in 227.2 secs
    index size = 1.7G

  Total Docs/sec:             old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:            old    46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new    94.3 [   59.3% less]



96 MB

  old
    200000 docs in 683.4 secs
    index size = 1.7G

  new
    200000 docs in 226.8 secs
    index size = 1.7G

  Total Docs/sec:             old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:            old    46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against "new" because with the
    patch you get an optimized index in the end but with "old" you
    don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
    docs/sec @ 312.4.

  * New (with patch) seems to just get faster the more memory you give
    it, though gradually.  The peak was 96 MB (the largest I ran).  So
    no sweet spot (or maybe I need to give more memory, but, above 96
    MB the trunk was starting to swap on my test env).

  * New gets better and better RAM efficiency, the more RAM you give.
    This makes sense: it's better able to compress the terms dict, the
    more docs are merged in RAM before having to flush to disk.  I
    would also expect this curve to be somewhat content dependent.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]






               
___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk 

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

"eks dev" <[hidden email]> wrote:
> wow, impressive numbers, congrats !

Thanks!  But remember many Lucene apps won't see these speedups since I've
carefully minimized cost of tokenization and cost of document retrieval.  I
think for many Lucene apps these are a sizable part of time spend indexing.

Next up I'm going to test thread concurrency of new vs old.

And then still a fair number of things to resolve before committing...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Otis Gospodnetic-2
In reply to this post by Kenneth William Krugler (Jira)
Quick question, Mike:

You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old    34.5; new     3.4 [   90.1% less]).

I don't follow 100% of what you are doing in LUCENE-843, so could you please explain what these 2 different amounts of RAM are?
Is the first (1-96) the RAM you use for in-memory merging of segments?
What is the RAM used @ flush?  More precisely, why is it that that amount of RAM exceeds the RAM buffer?

Thanks,
Otis



----- Original Message ----
From: Michael McCandless (JIRA) <[hidden email]>
To: [hidden email]
Sent: Thursday, April 5, 2007 9:22:32 AM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents


    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
    200000 docs in 862.2 secs
    index size = 1.7G

  new
    200000 docs in 297.1 secs
    index size = 1.7G

  Total Docs/sec:             old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:            old    47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old    34.5; new     3.4 [   90.1% less]



2 MB

  old
    200000 docs in 828.7 secs
    index size = 1.7G

  new
    200000 docs in 279.0 secs
    index size = 1.7G

  Total Docs/sec:             old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:            old    47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old    37.9; new     4.5 [   88.0% less]



4 MB

  old
    200000 docs in 840.5 secs
    index size = 1.7G

  new
    200000 docs in 260.8 secs
    index size = 1.7G

  Total Docs/sec:             old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:            old    46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old    33.9; new     6.5 [   80.9% less]



8 MB

  old
    200000 docs in 678.8 secs
    index size = 1.7G

  new
    200000 docs in 248.8 secs
    index size = 1.7G

  Total Docs/sec:             old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:            old    46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old    60.3; new    10.7 [   82.2% less]



16 MB

  old
    200000 docs in 660.6 secs
    index size = 1.7G

  new
    200000 docs in 247.3 secs
    index size = 1.7G

  Total Docs/sec:             old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:            old    46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old    47.1; new    19.2 [   59.3% less]



24 MB

  old
    200000 docs in 658.1 secs
    index size = 1.7G

  new
    200000 docs in 243.0 secs
    index size = 1.7G

  Total Docs/sec:             old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:            old    46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old    70.0; new    27.5 [   60.8% less]



32 MB

  old
    200000 docs in 714.2 secs
    index size = 1.7G

  new
    200000 docs in 239.2 secs
    index size = 1.7G

  Total Docs/sec:             old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old    92.5; new    36.7 [   60.3% less]



48 MB

  old
    200000 docs in 640.3 secs
    index size = 1.7G

  new
    200000 docs in 236.0 secs
    index size = 1.7G

  Total Docs/sec:             old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:            old    46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new    52.8 [   62.0% less]



64 MB

  old
    200000 docs in 649.3 secs
    index size = 1.7G

  new
    200000 docs in 238.3 secs
    index size = 1.7G

  Total Docs/sec:             old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:            old    46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new    72.7 [   76.0% less]



80 MB

  old
    200000 docs in 670.2 secs
    index size = 1.7G

  new
    200000 docs in 227.2 secs
    index size = 1.7G

  Total Docs/sec:             old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:            old    46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new    94.3 [   59.3% less]



96 MB

  old
    200000 docs in 683.4 secs
    index size = 1.7G

  new
    200000 docs in 226.8 secs
    index size = 1.7G

  Total Docs/sec:             old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:            old    46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against "new" because with the
    patch you get an optimized index in the end but with "old" you
    don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
    docs/sec @ 312.4.

  * New (with patch) seems to just get faster the more memory you give
    it, though gradually.  The peak was 96 MB (the largest I ran).  So
    no sweet spot (or maybe I need to give more memory, but, above 96
    MB the trunk was starting to swap on my test env).

  * New gets better and better RAM efficiency, the more RAM you give.
    This makes sense: it's better able to compress the terms dict, the
    more docs are merged in RAM before having to flush to disk.  I
    would also expect this curve to be somewhat content dependent.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Chris Hostetter-3
In reply to this post by Michael McCandless-2

: Thanks!  But remember many Lucene apps won't see these speedups since I've
: carefully minimized cost of tokenization and cost of document retrieval.  I
: think for many Lucene apps these are a sizable part of time spend indexing.

true, but as long as the changes you are making has no impact on the
tokenization/docbuilding times, that shouldn't be a factor -- that should
be consiered a "cosntant time" adjunct to the code you are varying ...
people with expensive analysis may not see any significant increases, but
that's their own problem -- people concerned about performance will
already have that as fast as they can get it, and now the internals of
document adding will get faster as well.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12345