[jira] Created: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

classic Classic list List threaded Threaded
81 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Mike Klaas
On 4/5/07, Chris Hostetter <[hidden email]> wrote:

>
> : Thanks!  But remember many Lucene apps won't see these speedups since I've
> : carefully minimized cost of tokenization and cost of document retrieval.  I
> : think for many Lucene apps these are a sizable part of time spend indexing.
>
> true, but as long as the changes you are making has no impact on the
> tokenization/docbuilding times, that shouldn't be a factor -- that should
> be consiered a "cosntant time" adjunct to the code you are varying ...
> people with expensive analysis may not see any significant increases, but
> that's their own problem -- people concerned about performance will
> already have that as fast as they can get it, and now the internals of
> document adding will get faster as well.

Especially since it is relatively easy for users to tweak the analysis
bits for performance--compared to the messy guts of index creation.

I am eagerly tracking the progress of your work.

-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2

"Mike Klaas" <[hidden email]> wrote:

> On 4/5/07, Chris Hostetter <[hidden email]> wrote:
> >
> > : Thanks!  But remember many Lucene apps won't see these speedups since I've
> > : carefully minimized cost of tokenization and cost of document retrieval.  I
> > : think for many Lucene apps these are a sizable part of time spend indexing.
> >
> > true, but as long as the changes you are making has no impact on the
> > tokenization/docbuilding times, that shouldn't be a factor -- that should
> > be consiered a "cosntant time" adjunct to the code you are varying ...
> > people with expensive analysis may not see any significant increases, but
> > that's their own problem -- people concerned about performance will
> > already have that as fast as they can get it, and now the internals of
> > document adding will get faster as well.
>
> Especially since it is relatively easy for users to tweak the analysis
> bits for performance--compared to the messy guts of index creation.
>
> I am eagerly tracking the progress of your work.

Thanks Mike (and Hoss).

Hoss, what you said is correct: I'm only affecting the actual indexing of
a document, nothing before that.

I just want to make sure I get that disclaimer out, as much as possible, so
nobody tries the patch and says "Hey!  My app only got 10% faster!  This was
false advertising!".

People who indeed have minimized their doc retrieval and tokenization time
should see speedups around what I'm seeing with the benchmarks (I hope!).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2
In reply to this post by Otis Gospodnetic-2

Hi Otis!

"Otis Gospodnetic" <[hidden email]> wrote:
> You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
> of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old    34.5; new  
>  3.4 [   90.1% less]).
>
> I don't follow 100% of what you are doing in LUCENE-843, so could you
> please explain what these 2 different amounts of RAM are?
> Is the first (1-96) the RAM you use for in-memory merging of segments?
> What is the RAM used @ flush?  More precisely, why is it that that amount
> of RAM exceeds the RAM buffer?

Very good questions!

When I say "the RAM buffer size is set to 96 MB", what I mean is I
flush the writer when the in-memory segments are using 96 MB RAM.  On
trunk, I just call ramSizeInBytes().  I do the analogous thing with my
patch (sum up size of RAM buffers used by segments).  I call this part
of the RAM usage the "indexed documents RAM".  With every added
document, this grows.

But: this does not account for all data structures (Posting instances,
HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used,
but not saved away, during the indexing of a single document.  All the
"things" used temporarily while indexing a document take up RAM too.
I call this part of the RAM usage the "document processing RAM".  This
RAM does not grow with every added document, though its size is in
proportion to the how big each document is.  This memory is always
re-used (does not grow with time).  But with the trunk, this is done
by creating garbage, whereas with my patch, I explicitly reuse it.

When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.

With the trunk, the actual process memory usage tends to be quite a
bit higher than the RAM buffer size and also tends to be very "noisy"
(jumps around with each flush).  I think this is because of
delays/unpredictability on when GC kicks in to reclaim the garbage
created during indexing of the doc.  Whereas with my patch, it's
usually quite a bit closer to the "indexed documents RAM" and does not
jump around nearly as much.

So the "actual process RAM used" will always exceed my "RAM buffer
size".  The amount of excess is a measure of the "overhead" required
to process the document.  The trunk has far worse overhead than with
my patch, which I think means a given application will be able to use
a *larger* RAM buffer size with LUCENE-843.

Does that make sense?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Grant Ingersoll-4

Michael, like everyone else, I am watching this very closely.  So far  
it sounds great!

On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:

> When I measure "amount of RAM @ flush time", I'm calling
> MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
> process memory usage which should be (for my tests) around the sum of
> the above two types of RAM usage.

One thing caught my eye, though, MemoryMXBean is JDK 1.5.  :-(

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ 
MemoryMXBean.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael McCandless-2
"Grant Ingersoll" <[hidden email]> wrote:

>
> Michael, like everyone else, I am watching this very closely.  So far  
> it sounds great!
>
> On Apr 5, 2007, at 8:03 PM, Michael McCandless wrote:
>
> > When I measure "amount of RAM @ flush time", I'm calling
> > MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
> > process memory usage which should be (for my tests) around the sum of
> > the above two types of RAM usage.
>
> One thing caught my eye, though, MemoryMXBean is JDK 1.5.  :-(
>
> http://java.sun.com/j2se/1.5.0/docs/api/java/lang/management/ 
> MemoryMXBean.html

Yeah, thanks for pointing this out.  I'm only using that to do my
benchmarking, not to actually measure RAM usage for "when to flush",
so I will definitely remove it before committing (I always go to a
1.4.2 environment and do a "ant clean test" to be certain I didn't do
something like this by accident :).

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Otis Gospodnetic-2
In reply to this post by Akash (Jira)
Mike - thanks for explanation, it makes perfect sense!

Otis

----- Original Message ----
From: Michael McCandless <[hidden email]>
To: [hidden email]
Sent: Thursday, April 5, 2007 8:03:44 PM
Subject: Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to   buffer added documents


Hi Otis!

"Otis Gospodnetic" <[hidden email]> wrote:
> You talk about a RAM buffer from 1MB - 96MB, but then you have the amount
> of RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old    34.5; new  
>  3.4 [   90.1% less]).
>
> I don't follow 100% of what you are doing in LUCENE-843, so could you
> please explain what these 2 different amounts of RAM are?
> Is the first (1-96) the RAM you use for in-memory merging of segments?
> What is the RAM used @ flush?  More precisely, why is it that that amount
> of RAM exceeds the RAM buffer?

Very good questions!

When I say "the RAM buffer size is set to 96 MB", what I mean is I
flush the writer when the in-memory segments are using 96 MB RAM.  On
trunk, I just call ramSizeInBytes().  I do the analogous thing with my
patch (sum up size of RAM buffers used by segments).  I call this part
of the RAM usage the "indexed documents RAM".  With every added
document, this grows.

But: this does not account for all data structures (Posting instances,
HashMap, FieldsWriter, TermVectorsWriter, int[] arrays, aetc.) used,
but not saved away, during the indexing of a single document.  All the
"things" used temporarily while indexing a document take up RAM too.
I call this part of the RAM usage the "document processing RAM".  This
RAM does not grow with every added document, though its size is in
proportion to the how big each document is.  This memory is always
re-used (does not grow with time).  But with the trunk, this is done
by creating garbage, whereas with my patch, I explicitly reuse it.

When I measure "amount of RAM @ flush time", I'm calling
MemoryMXBean.getHeapMemoryUsage().getUsed().  So, this measures actual
process memory usage which should be (for my tests) around the sum of
the above two types of RAM usage.

With the trunk, the actual process memory usage tends to be quite a
bit higher than the RAM buffer size and also tends to be very "noisy"
(jumps around with each flush).  I think this is because of
delays/unpredictability on when GC kicks in to reclaim the garbage
created during indexing of the doc.  Whereas with my patch, it's
usually quite a bit closer to the "indexed documents RAM" and does not
jump around nearly as much.

So the "actual process RAM used" will always exceed my "RAM buffer
size".  The amount of excess is a measure of the "overhead" required
to process the document.  The trunk has far worse overhead than with
my patch, which I think means a given application will be able to use
a *larger* RAM buffer size with LUCENE-843.

Does that make sense?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take5.patch

I attached a new iteration of the patch.  It's quite different from
the last patch.

After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs.  This is in contrast to the previous approach where
each doc makes its own segment and then they are merged.

It turns out this is even faster than my previous approach, especially
for smaller docs and especially when term vectors are off (because no
quicksort() is needed until the segment is flushed).  I will attach
new benchmark results.

Other changes:

  * Changed my benchmarking tool / testing (IndexLineFiles):

    - I turned off compound file (to reduce time NOT spent on
      indexing).

    - I noticed I was not downcasing the terms, so I fixed that

    - I now do my own line processing to reduce GC cost of
      "BufferedReader.readLine" (to reduct time NOT spent on
      indexing).

  * Norms now properly flush to disk in the autoCommit=false case

  * All unit tests pass except disk full

  * I turned on asserts for unit tests (jvm arg -ea added to junit ant
    task).  I think we should use asserts when running tests.  I have
    quite a few asserts now.

With this new approach, as I process each term in the document I
immediately write the prox/freq in their compact (vints) format into
shared byte[] buffers, rather than accumulating int[] arrays that then
need to be re-processed into the vint encoding.  This speeds things up
because we don't double-process the postings.  It also uses less
per-document RAM overhead because intermediate postings are stored as
vints not as ints.

When enough RAM is used by the Posting entries plus the byte[]
buffers, I flush them to a partial RAM segment.  When enough of these
RAM segments have accumulated I flush to a real Lucene segment
(autoCommit=true) or to on-disk partial segments (autoCommit=false)
which are then merged in the end to create a real Lucene segment.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492655 ]

Marvin Humphrey commented on LUCENE-843:
----------------------------------------

How are you writing the frq data in compressed format?  The  works fine for
prx data, because the deltas are all within a single doc -- but for  the freq
data, the deltas are tied up in doc num deltas, so you have to decompress it
when performing merges.  

To continue our discussion from java-dev...

 * I haven't been able to come up with a file format tweak that
   gets around this doc-num-delta-decompression problem to enhance the speed
   of frq data merging. I toyed with splitting off the freq from the
   doc_delta, at the price of increasing the file size in the common case of
   freq == 1, but went back to the old design.  It's not worth the size
   increase for what's at best a minor indexing speedup.
 * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
   chunks, allows resizing (downwards) of only the last allocation, and can
   only release everything at once.  From one of these pools, I'm allocating
   RawPosting objects, each of which is a doc_num, a freq, the term_text, and
   the pre-packed prx data (which varies based on which Posting subclass
   created the RawPosting object).  I haven't got things 100% stable yet, but
   preliminary results seem to indicate that this technique, which is a riff
   on your persistent arrays, improves indexing speed by about 15%.

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492658 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------


> How are you writing the frq data in compressed format? The works fine for
> prx data, because the deltas are all within a single doc -- but for the freq
> data, the deltas are tied up in doc num deltas, so you have to decompress it
> when performing merges.

For each Posting I keep track of the last docID that its term occurred
in; when this differs from the current docID I record the "delta code"
that needs to be written and then I later write it with the final freq
for this document.

> * I haven't been able to come up with a file format tweak that
>   gets around this doc-num-delta-decompression problem to enhance the speed
>   of frq data merging. I toyed with splitting off the freq from the
>   doc_delta, at the price of increasing the file size in the common case of
>   freq == 1, but went back to the old design. It's not worth the size
>   increase for what's at best a minor indexing speedup.

I'm just doing the "stitching" approach here: it's only the very first
docCode (& freq when freq==1) that must be re-encoded on merging.  The
one catch is you must store the last docID of the previous segment so
you can compute the new delta at the boundary.  Then I do a raw
"copyBytes" for the remainder of the freq postings.

Note that I'm only doing this for the "internal" merges (of partial
RAM segments and flushed partial segments) I do before creating a real
Lucene segment.  I haven't changed how the "normal" Lucene segment
merging works (though I think we should look into it -- I opened a
separate issue): it still re-interprets and then re-encodes all
docID/freq's.

> * I've added a custom MemoryPool class to KS which grabs memory in 1 meg
>   chunks, allows resizing (downwards) of only the last allocation, and can
>   only release everything at once. From one of these pools, I'm allocating
>    RawPosting objects, each of which is a doc_num, a freq, the term_text, and
>   the pre-packed prx data (which varies based on which Posting subclass
>   created the RawPosting object). I haven't got things 100% stable yet, but
>   preliminary results seem to indicate that this technique, which is a riff
>   on your persistent arrays, improves indexing speed by about 15%.

Fabulous!!

I think it's the custom memory management I'm doing with slices into
shared byte[] arrays for the postings that made the persistent hash
approach work well, this time around (when I had previously tried this
it was slower).


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492668 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

Results with the above patch:

RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10


  2000000 DOCS @ ~550 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          2000000 docs in 782.8 secs
          index size = 436M

        new
          2000000 docs in 93.4 secs
          index size = 430M

        Total Docs/sec:             old  2554.8; new 21421.1 [  738.5% faster]
        Docs/MB @ flush:            old   128.0; new  4058.0 [ 3069.6% more]
        Avg RAM used (MB) @ flush:  old   140.2; new    38.0 [   72.9% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          2000000 docs in 780.2 secs
          index size = 436M

        new
          2000000 docs in 90.6 secs
          index size = 427M

        Total Docs/sec:             old  2563.3; new 22086.8 [  761.7% faster]
        Docs/MB @ flush:            old   128.0; new  4118.4 [ 3116.7% more]
        Avg RAM used (MB) @ flush:  old   144.6; new    36.4 [   74.8% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          2000000 docs in 1227.6 secs
          index size = 2.1G

        new
          2000000 docs in 559.8 secs
          index size = 2.1G

        Total Docs/sec:             old  1629.2; new  3572.5 [  119.3% faster]
        Docs/MB @ flush:            old    93.1; new  4058.0 [ 4259.1% more]
        Avg RAM used (MB) @ flush:  old   193.4; new    38.5 [   80.1% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          2000000 docs in 1229.2 secs
          index size = 2.1G

        new
          2000000 docs in 241.0 secs
          index size = 2.1G

        Total Docs/sec:             old  1627.0; new  8300.0 [  410.1% faster]
        Docs/MB @ flush:            old    93.1; new  4118.4 [ 4323.9% more]
        Avg RAM used (MB) @ flush:  old   150.5; new    38.4 [   74.5% less]



  200000 DOCS @ ~5,500 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 352.2 secs
          index size = 406M

        new
          200000 docs in 86.4 secs
          index size = 403M

        Total Docs/sec:             old   567.9; new  2313.7 [  307.4% faster]
        Docs/MB @ flush:            old    83.5; new   420.0 [  402.7% more]
        Avg RAM used (MB) @ flush:  old    76.8; new    38.1 [   50.4% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 399.2 secs
          index size = 406M

        new
          200000 docs in 89.6 secs
          index size = 400M

        Total Docs/sec:             old   501.0; new  2231.0 [  345.3% faster]
        Docs/MB @ flush:            old    83.5; new   422.6 [  405.8% more]
        Avg RAM used (MB) @ flush:  old    76.7; new    36.2 [   52.7% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          200000 docs in 594.2 secs
          index size = 1.7G

        new
          200000 docs in 229.0 secs
          index size = 1.7G

        Total Docs/sec:             old   336.6; new   873.3 [  159.5% faster]
        Docs/MB @ flush:            old    47.9; new   420.0 [  776.9% more]
        Avg RAM used (MB) @ flush:  old   157.8; new    38.0 [   75.9% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          200000 docs in 605.1 secs
          index size = 1.7G

        new
          200000 docs in 181.3 secs
          index size = 1.7G

        Total Docs/sec:             old   330.5; new  1103.1 [  233.7% faster]
        Docs/MB @ flush:            old    47.9; new   422.6 [  782.2% more]
        Avg RAM used (MB) @ flush:  old   132.0; new    37.1 [   71.9% less]



  20000 DOCS @ ~55,000 bytes plain text


    No term vectors nor stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 180.8 secs
          index size = 350M

        new
          20000 docs in 79.1 secs
          index size = 349M

        Total Docs/sec:             old   110.6; new   252.8 [  128.5% faster]
        Docs/MB @ flush:            old    25.0; new    46.8 [   87.0% more]
        Avg RAM used (MB) @ flush:  old   112.2; new    44.3 [   60.5% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 180.1 secs
          index size = 350M

        new
          20000 docs in 75.9 secs
          index size = 347M

        Total Docs/sec:             old   111.0; new   263.5 [  137.3% faster]
        Docs/MB @ flush:            old    25.0; new    47.5 [   89.7% more]
        Avg RAM used (MB) @ flush:  old   111.1; new    42.5 [   61.7% less]



    With term vectors (positions + offsets) and 2 small stored fields

      AUTOCOMMIT = true (commit whenever RAM is full)

        old
          20000 docs in 323.1 secs
          index size = 1.4G

        new
          20000 docs in 183.9 secs
          index size = 1.4G

        Total Docs/sec:             old    61.9; new   108.7 [   75.7% faster]
        Docs/MB @ flush:            old    10.4; new    46.8 [  348.3% more]
        Avg RAM used (MB) @ flush:  old    74.2; new    44.9 [   39.5% less]


      AUTOCOMMIT = false (commit only once at the end)

        old
          20000 docs in 323.5 secs
          index size = 1.4G

        new
          20000 docs in 135.6 secs
          index size = 1.4G

        Total Docs/sec:             old    61.8; new   147.5 [  138.5% faster]
        Docs/MB @ flush:            old    10.4; new    47.5 [  354.8% more]
        Avg RAM used (MB) @ flush:  old    74.3; new    42.9 [   42.2% less]



> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492674 ]

Yonik Seeley commented on LUCENE-843:
-------------------------------------

How does this work with pending deletes?
I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms?

Also, how has the merge policy (or index invariants) of lucene segments changed?
If autocommit is off, then you wait until the end to create a big lucene segment.  This new segment may be much larger than segments to it's "left".  I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments?  Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492748 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

> How does this work with pending deletes?
> I assume that if autocommit is false, then you need to wait until the end when you get a real lucene segment to delete the pending terms?

Yes, all of this sits "below" the pending deletes layer since this
change writes a single segment either when RAM is full
(autoCommit=true) or when writer is closed (autoCommit=false).  Then
the deletes get applied like normal (I haven't changed that part).

> Also, how has the merge policy (or index invariants) of lucene segments changed?
> If autocommit is off, then you wait until the end to create a big lucene segment.  This new segment may be much larger than segments to it's "left".  I suppose the idea of merging rightmost segments should just be dropped in favor of merging the smallest adjacent segments?  Sorry if this has already been covered... as I said, I'm trying to follow along at a high level.

Has not been covered, and as usual these are excellent questions
Yonik!

I haven't yet changed anything about merge policy, but you're right
the current invariants won't hold anymore.  In fact they already don't
hold if you "flush by RAM" now (APIs are exposed in 2.1 to let you do
this).  So we need to do something.

I like your idea to relax merge policy (& invariants) to allow
"merging of any adjacent segments" (not just rightmost ones) and then
make the policy merge the smallest ones / most similarly sized ones,
measuring size by net # bytes in the segment.  This would preserve the
"docID monotonicity invariance".

If we take that approach then it would automatically resolve
LUCENE-845 as well (which would otherwise block this issue).


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take6.patch

Attached latest patch.

I'm now working towards simplify & cleaning up the code & design:
eliminated dead code leftover from the previous iterations, use
existing RAMFile instead of my own new class, refactored
duplicate/confusing code, added comments, etc. It's getting closer to
a committable state but still has a ways to go.

I also renamed the class from MultiDocumentWriter to DocumentsWriter.

To summarize the current design:

  1. Write stored fields & term vectors to files in the Directory
     immediately (don't buffer these in RAM).

  2. Write freq & prox postings to RAM directly as a byte stream
     instead of first pass as int[] and then second pass as a byte
     stream.  This single-pass instead of double-pass is a big
     savings.  I use slices into shared byte[] arrays to efficiently
     allocate bytes to the postings the need them.

  3. Build Postings hash that holds the Postings for many documents at
     once instead of a single doc, keyed by unique term.  Not tearing
     down & rebuilding the Postings hash w/ every doc saves alot of
     time.  Also when term vectors are off this saves quicksort for
     every doc and this gives very good performance gain.

     When the Postings hash is full (used up the allowed RAM usage) I
     then create a real Lucene segment when autoCommit=true, else a
     "partial segment".

  4. Use my own "partial segment" format that differs from Lucene's
     normal segments in that it is optimized for merging (and unusable
     for searching).  This format, and the merger I created to work
     with this format, performs merging mostly by copying blocks of
     bytes instead of reinterpreting every vInt in each Postings list.
     These partial segments are are only created when IndexWriter has
     autoCommit=false, and then on commit they are merged into the
     real Lucene segment format.

  5. Reuse the Posting, PostingVector, char[] and byte[] objects that
     are used by the Postings hash.

I plan to keep simplifying the design & implementation.  Specifically,
I'm going to test removing #4 above entirely (using my own "partial
segment" format that's optimized for merging not searching).

While doing this may give back some of the performance gains, that
code is the source of much added complexity in the patch, and, it
duplicates the current SegmentMerger code.  It was more necessary
before (when we would merge thousands of single-doc segments in
memory) but now that each segment contains many docs I think we are no
longer gaining as much performance from it.

I plan instead to write all segments in the "real" Lucene segment
format and use the current SegmentMerger, possibly w/ some small
changes, to do the merges even when autoCommit=false.  Since we have
another issue (LUCENE-856) to optimize segment merging I can carry
over any optimizations that we may want to keep into that issue.  If
this doesn't lose much performance it will make the approach here even
simpler.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take7.patch

Latest working patch attached.

I've cutover to using Lucene's normal segment merging for all merging
(ie, I no longer use a different merge-efficient format for segments
when autoCommit=false); this has substantially simplified the code.

All unit tests pass except disk-full test and certain contrib tests
(gdata-server, lucli, similarity, wordnet) that I think I'm not
causing.

Other changes:

  * Consolidated flushing of a new segment back into IndexWriter
    (previously DocumentsWriter would do its own flushing when
    autoCommit=false).

    I would also like to consolidate merging entirely into
    IndexWriter; right now DocumentsWriter does its own merging of the
    flushed segments when autoCommit=false (this is because those
    segments are "partial" meaning they do not have their own stored
    fields or term vectors).  I'm trying to find a clean way to do
    this...

  * Thread concurrency now works: each thread writes into a separate
    Postings hash (up until a limit (currently 5) at which point the
    threads share the Postings hashes) and then when flushing the
    segment I merge the docIDs together. I flush when the total RAM
    used across threads is over the limit.  I ran a test comparing
    thread concurrency on current trunk vs this patch, which I'll post
    next.

  * Reduced bytes used per-unique-term to be lower than current
    Lucene.  This means the worst-case document (many terms, all of
    which are unique) should use less RAM overall than Lucene trunk
    does.

  * Added some new unit test cases; added missing "writer.close()" to
    one of the contrib tests.

  * Cleanup, comments, etc.  I think the code is getting more
    "approachable" now.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502793 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

I ran a benchmark using more than 1 thread to do indexing, in order to
test & compare concurrency of trunk and the patch.  The test is the
same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive
RAID 0 IO system.

Here are the raw results:

DOCS = ~5,500 bytes plain text
RAM = 32 MB
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)

NUM THREADS = 1

        new
          200000 docs in 172.3 secs
          index size = 1.7G

        old
          200000 docs in 539.5 secs
          index size = 1.7G

        Total Docs/sec:             old   370.7; new  1161.0 [  213.2% faster]
        Docs/MB @ flush:            old    47.9; new   334.6 [  598.7% more]
        Avg RAM used (MB) @ flush:  old   131.9; new    33.1 [   74.9% less]


NUM THREADS = 2

        new
          200001 docs in 130.8 secs
          index size = 1.7G

        old
          200001 docs in 452.8 secs
          index size = 1.7G

        Total Docs/sec:             old   441.7; new  1529.3 [  246.2% faster]
        Docs/MB @ flush:            old    47.9; new   301.5 [  529.7% more]
        Avg RAM used (MB) @ flush:  old   226.1; new    35.2 [   84.4% less]


NUM THREADS = 3

        new
          200002 docs in 105.4 secs
          index size = 1.7G

        old
          200002 docs in 428.4 secs
          index size = 1.7G

        Total Docs/sec:             old   466.8; new  1897.9 [  306.6% faster]
        Docs/MB @ flush:            old    47.9; new   277.8 [  480.2% more]
        Avg RAM used (MB) @ flush:  old   289.8; new    37.0 [   87.2% less]


NUM THREADS = 4

        new
          200003 docs in 104.8 secs
          index size = 1.7G

        old
          200003 docs in 440.4 secs
          index size = 1.7G

        Total Docs/sec:             old   454.1; new  1908.5 [  320.3% faster]
        Docs/MB @ flush:            old    47.9; new   259.9 [  442.9% more]
        Avg RAM used (MB) @ flush:  old   293.7; new    37.1 [   87.3% less]


NUM THREADS = 5

        new
          200004 docs in 99.5 secs
          index size = 1.7G

        old
          200004 docs in 425.0 secs
          index size = 1.7G

        Total Docs/sec:             old   470.6; new  2010.5 [  327.2% faster]
        Docs/MB @ flush:            old    47.9; new   245.3 [  412.6% more]
        Avg RAM used (MB) @ flush:  old   390.9; new    38.3 [   90.2% less]


NUM THREADS = 6

        new
          200005 docs in 106.3 secs
          index size = 1.7G

        old
          200005 docs in 427.1 secs
          index size = 1.7G

        Total Docs/sec:             old   468.2; new  1882.3 [  302.0% faster]
        Docs/MB @ flush:            old    47.8; new   248.5 [  419.3% more]
        Avg RAM used (MB) @ flush:  old   340.9; new    38.7 [   88.6% less]


NUM THREADS = 7

        new
          200006 docs in 106.1 secs
          index size = 1.7G

        old
          200006 docs in 435.2 secs
          index size = 1.7G

        Total Docs/sec:             old   459.6; new  1885.3 [  310.2% faster]
        Docs/MB @ flush:            old    47.8; new   248.7 [  420.0% more]
        Avg RAM used (MB) @ flush:  old   408.6; new    39.1 [   90.4% less]


NUM THREADS = 8

        new
          200007 docs in 109.0 secs
          index size = 1.7G

        old
          200007 docs in 469.2 secs
          index size = 1.7G

        Total Docs/sec:             old   426.3; new  1835.2 [  330.5% faster]
        Docs/MB @ flush:            old    47.8; new   251.3 [  425.5% more]
        Avg RAM used (MB) @ flush:  old   448.9; new    39.0 [   91.3% less]



Some quick comments:

  * Both trunk & the patch show speedups if you use more than 1 thread
    to do indexing.  This is expected since the machine has concurrency.

  * The biggest speedup is from 1->2 threads but still good gains from
    2->5 threads.

  * Best seems to be 5 threads.

  * The patch allows better concurrency: relatively speaking it speeds
    up faster than the trunk (the % faster increases as we add
    threads) as you increase # threads.  I think this makes sense
    because we flush less often with the patch, and, flushing is time
    consuming and single threaded.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take8.patch

Attached latest patch.

I think this patch is ready to commit.  I will let it sit for a while
so people can review it.

We still need to do LUCENE-845 before it can be committed as is.

However one option instead would be to commit this patch, but leave
IndexWriter flushing by doc count by default and then later switch it
to flush by net RAM usage once LUCENE-845 is done.  I like this option
best.

All tests pass (I've re-enabled the disk full tests and fixed error
handling so they now pass) on Windows XP, Debian Linux and OS X.

Summary of the changes in this rev:

  * Finished cleaning up & commenting code

  * Exception handling: if there is a disk full or any other exception
    while adding a document or flushing then the index is rolled back
    to the last commit point.

  * Added more unit tests

  * Removed my profiling tool from the patch (not intended to be
    committed)

  * Fixed a thread safety issue where if you flush by doc count you
    would sometimes get more than the doc count at flush than you
    requested.  I moved the thread synchronization for determining
    flush time down into DocumentsWriter.

  * Also fixed thread safety of calling flush with one thread while
    other threads are still adding documents.

  * The biggest change is: absorbed all merging logic back into
    IndexWriter.

    Previously in DocumentsWriter I was tracking my own
    flushed/partial segments and merging them on my own (but using
    SegmentMerger).  This makes DocumentsWriter much simpler: now its
    sole purpose is to gather added docs and write a new segment.

    This turns out to be a big win:

      - Code is much simpler (no duplication of "merging"
        policy/logic)

      - 21-25% additional performance gain for autoCommit=false case
        when stored fields & vectors are used

      - IndexWriter.close() no longer takes an unexpected long time to
        close in autoCommit=false case

    However I had to make a change to the index format to do this.
    The basic idea is to allow multiple segments to share access to
    the "doc store" (stored fields, vectors) index files.

    The change is quite simple: FieldsReader/VectorsReader are now
    told the doc offset that they should start from when seeking in
    the index stream (this info is stored in SegmentInfo).  When
    merging segments we don't merge the "doc store" files when all
    segments are sharing the same ones (big performance gain), else,
    we make a private copy of the "doc store" files (ie as segments
    normally are on the trunk today).

    The change is fully backwards compatible (I added a test case to
    the backwards compatibility unit test to be sure) and the change
    is only used when autoCommit=false.

    When autoCommit=false, the writer will append stored fields /
    vectors to a single set of files even though it is flushing normal
    segments whenever RAM is full.  These normal segments all refer to
    the single shared set of "doc store" files.  Then when segments
    are merged, the newly merged segment has its own "private" doc
    stores again.  So the sharing only occurs for the "level 0"
    segments.

    I still need to update fileformats doc with this change.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505373 ]

Yonik Seeley commented on LUCENE-843:
-------------------------------------

> When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),

Is this only in the case where the segments have no deleted docs?


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505418 ]

Michael McCandless commented on LUCENE-843:
-------------------------------------------

> > When merging segments we don't merge the "doc store" files when all segments are sharing the same ones (big performance gain),
>
> Is this only in the case where the segments have no deleted docs?

Right.  Also the segments must be contiguous which the current merge
policy ensures but future merge policies may not.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-843:
--------------------------------------

    Attachment: LUCENE-843.take9.patch

OK, I attached a new version (take9) of the patch that reverts back to
the default of "flush after every 10 documents added" in IndexWriter.
This removes the dependency on LUCENE-845.

However, I still think we should later (once LUCENE-845 is done)
default IndexWriter to flush by RAM usage since this will generally
give the best "out of the box" performance.  I will open a separate
issue to change the default after this issue is resolved.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Akash (Jira)
In reply to this post by Akash (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506576 ]

Steven Parkes commented on LUCENE-843:
--------------------------------------

I've started looking at this, what it would take to merge with the merge policy stuff (LUCENE-847). Noticed that there are a couple of test failures?

> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12345