[jira] Created: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
Improve indexing performance by increasing internal buffer sizes
----------------------------------------------------------------

                 Key: LUCENE-888
                 URL: https://issues.apache.org/jira/browse/LUCENE-888
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.1
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor


In working on LUCENE-843, I noticed that two buffer sizes have a
substantial impact on overall indexing performance.

First is BufferedIndexOutput.BUFFER_SIZE (also used by
BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
actually build the compound file.  Both are now 1 KB (1024 bytes).

I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
~5,500 byte plain text docs derived from the Europarl corpus
(English).  I index 200,000 docs with compound file enabled and term
vector positions & offsets stored plus stored fields.  I flush
documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
optimized in the end and I left mergeFactor @ 10.

I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
system.

At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
I increase both buffers to 8 KB it takes 554 sec to build the index,
which is an 11% overall gain!

I will run more tests to see if there is a natural knee in the curve
(buffer size above which we don't really gain much more performance).

I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
at 1024, at least for now.  During searching there can be quite a few
of this class instantiated, and likely a larger buffer size for the
freq/prox streams could actually hurt search performance for those
searches that use skipping.

The CompoundFileWriter buffer is created only briefly, so I think we
can use a fairly large (32 KB?) buffer there.  And there should not be
too many BufferedIndexOutputs alive at once so I think a large-ish
buffer (16 KB?) should be OK.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498652 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!

Cool!

> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.

I submitted a patch for LUCENE-430 which avoids copying the buffer when
a BufferedIndexInput is cloned. With this patch we could also add a
method setBufferSize(int) to BufferedIndexInput. This method has to
be called then right after the input has been created or cloned and
before the first read is performed (the first read operation allocates
the buffer). If called later it wouldn't have any effect. This would
allow us to adjust the buffer size dynamically, e. g. use large buffers
for segment merges and stored fields, but smaller ones for freq/prox
streams, maybe dependent on the document frequency.
What do you think?

> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

I'm wondering how much performance benefits if you increase the buffer
size beyond the file system's page size? Does it make a big difference
if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
the numbers you propose are good.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498696 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> > I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> > at 1024, at least for now.  During searching there can be quite a few
> > of this class instantiated, and likely a larger buffer size for the
> > freq/prox streams could actually hurt search performance for those
> > searches that use skipping.
>
> I submitted a patch for LUCENE-430 which avoids copying the buffer when
> a BufferedIndexInput is cloned. With this patch we could also add a
> method setBufferSize(int) to BufferedIndexInput. This method has to
> be called then right after the input has been created or cloned and
> before the first read is performed (the first read operation allocates
> the buffer). If called later it wouldn't have any effect. This would
> allow us to adjust the buffer size dynamically, e. g. use large buffers
> for segment merges and stored fields, but smaller ones for freq/prox
> streams, maybe dependent on the document frequency.
> What do you think?

I like that idea!

I am actually seeing that increased buffer sizes for
BufferedIndexInput help performance of indexing as well (up to ~5%
just changing this buffer), so I think we do want to increase this but
only for merging.

I wonder if we should just add a ctor to BufferedIndexInput that takes
the bufferSize?  This would avoid the surprising API caveat you
describe above.  The problem is, then all classes (SegmentTermDocs,
SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
would also have to have ctors to change buffer sizes.  Even if we do
setBufferSize instead of new ctor we have some cases (eg at least
SegmentTermEnum) where bytes are read during construction so it's too
late for caller to then change buffer size.  Hmmm.  Not clear how to
do this cleanly...

Maybe we do the setBufferSize approach, but, if the buffer already
exists, rather than throwing an exception we check if the new size is
greater than the old size and if so we grow the buffer?  I can code this
up.

> > The CompoundFileWriter buffer is created only briefly, so I think we
> > can use a fairly large (32 KB?) buffer there.  And there should not be
> > too many BufferedIndexOutputs alive at once so I think a large-ish
> > buffer (16 KB?) should be OK.
>
> I'm wondering how much performance benefits if you increase the buffer
> size beyond the file system's page size? Does it make a big difference
> if you use 8 KB, 16 KB or 32 KB? If the answer is yes, then I think
> the numbers you propose are good.

I'm testing now different sizes of each of these three buffers
... will post the results.


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498714 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

> I wonder if we should just add a ctor to BufferedIndexInput that takes
> the bufferSize? This would avoid the surprising API caveat you
> describe above. The problem is, then all classes (SegmentTermDocs,
> SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
> would also have to have ctors to change buffer sizes. Even if we do
> setBufferSize instead of new ctor we have some cases (eg at least
> SegmentTermEnum) where bytes are read during construction so it's too
> late for caller to then change buffer size. Hmmm. Not clear how to
> do this cleanly...

Yeah I was thinking about the ctor approach as well. Actually
BufferedIndexInput does not have a public ctor so far, it's created by
using Directory.openInput(String fileName). And to add a new ctor would
mean an API change, so subclasses wouldn't compile anymore without
changes.
What me might want to do instead is to add a new new method
openInput(String fileName, int bufferSize) to Directory which calls
the existing openInput(String fileName) by default, so subclasses of
Directory would ignore the bufferSize parameter by default. Then we
can change FSDirectory to overwrite openInput(String, int):

  public IndexInput openInput(String name, int bufferSize)
                throws IOException {
    FSIndexInput input = new FSIndexInput(new File(directory, name));
        input.setBufferSize(bufferSize);
        return input;
  }

This should solve the problems you mentioned like in SegmentTermEnum
and we don't have to support setBufferSize() after a read has been
performed. It has also the advantage that we safe an instanceof and
cast from IndexInput to BufferedIndexInput before setBufferSize()
can be called.

After a clone however, we would still have to cast to
BufferedIndexInput before setBufferSize() can be called.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498739 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> > I wonder if we should just add a ctor to BufferedIndexInput that takes
> > the bufferSize? This would avoid the surprising API caveat you
> > describe above. The problem is, then all classes (SegmentTermDocs,
> > SegmentTermPositions, FieldsReader, etc.) that open an IndexInput
> > would also have to have ctors to change buffer sizes. Even if we do
> > setBufferSize instead of new ctor we have some cases (eg at least
> > SegmentTermEnum) where bytes are read during construction so it's too
> > late for caller to then change buffer size. Hmmm. Not clear how to
> > do this cleanly...
>
> Yeah I was thinking about the ctor approach as well. Actually
> BufferedIndexInput does not have a public ctor so far, it's created by
> using Directory.openInput(String fileName). And to add a new ctor would
> mean an API change, so subclasses wouldn't compile anymore without
> changes.

Actually, it does have a default public constructor right?  Ie if we add

  public BufferedIndexInput()
  public BufferedIndexInput(int bufferSize)

then I think we don't break API backwards compatibility?

> After a clone however, we would still have to cast to
> BufferedIndexInput before setBufferSize() can be called.

I plan to add "private int bufferSize" to BufferedIndexInput,
defaulting to BUFFER_SIZE.  I think then it would just work w/ your
LUCENE-430 patch because your patch sets the clone's buffer to null
and then when the clone allocates its buffer it will be length
bufferSize.  I think?


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498757 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

> Actually, it does have a default public constructor right? Ie if we add

>  public BufferedIndexInput()
>  public BufferedIndexInput(int bufferSize)

> then I think we don't break API backwards compatibility?

Oups! Of course, you are right. What was I thinking...

> I plan to add "private int bufferSize" to BufferedIndexInput,
> defaulting to BUFFER_SIZE. I think then it would just work w/ your
> LUCENE-430 patch because your patch sets the clone's buffer to null
> and then when the clone allocates its buffer it will be length
> bufferSize. I think?

True. But it would be nice if it was possible to change the buffer size
after a clone. For example in SegmentTermDocs we could then adjust the
buffer size of the cloned freqStream according to the document frequency.
And in my multi-level skipping patch (LUCENE-866) I could also benefit
from this functionality.

Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
same instance of SegmentTermDocs is used for different terms, then
the same clone is used. So actually it would be nice it was possible to
change the buffer size after read has performed.

> Maybe we do the setBufferSize approach, but, if the buffer already
> exists, rather than throwing an exception we check if the new size is
> greater than the old size and if so we grow the buffer? I can code this
> up.

So yes, I think we should implement it this way.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498776 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> > I plan to add "private int bufferSize" to BufferedIndexInput,
> > defaulting to BUFFER_SIZE. I think then it would just work w/ your
> > LUCENE-430 patch because your patch sets the clone's buffer to null
> > and then when the clone allocates its buffer it will be length
> > bufferSize. I think?
>
> True. But it would be nice if it was possible to change the buffer size
> after a clone. For example in SegmentTermDocs we could then adjust the
> buffer size of the cloned freqStream according to the document frequency.
> And in my multi-level skipping patch (LUCENE-866) I could also benefit
> from this functionality.

OK, I agree: let's add a BufferedIndexInput.setBufferSize() and also
openInput(path, bufferSize) to Directory base class & to FSDirectory.

> Hmm, in SegmentTermDocs the freq stream is cloned in the ctor. If the
> same instance of SegmentTermDocs is used for different terms, then
> the same clone is used. So actually it would be nice it was possible to
> change the buffer size after read has performed.
>
> > Maybe we do the setBufferSize approach, but, if the buffer already
> > exists, rather than throwing an exception we check if the new size is
> > greater than the old size and if so we grow the buffer? I can code this
> > up.
>
> So yes, I think we should implement it this way.

OK I will do this.  Actually, I think we should also allow making the
buffer smaller this way.  Meaning, I will preserve buffer contents
(starting from bufferPosition) as much as is allowed by the smaller
buffer.  This way there is no restriction on using this method
vs. having read bytes already ("principle of least surprise").


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498780 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

> OK I will do this. Actually, I think we should also allow making the
> buffer smaller this way. Meaning, I will preserve buffer contents
> (starting from bufferPosition) as much as is allowed by the smaller
> buffer. This way there is no restriction on using this method
> vs. having read bytes already ("principle of least surprise").

Yep sounds good. I can code this and commit it with LUCENE-430.
If you want to do it, then I should commit the existing 430 patch
soon, so that there are no conflicts in our patches?

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498784 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> > OK I will do this. Actually, I think we should also allow making the
> > buffer smaller this way. Meaning, I will preserve buffer contents
> > (starting from bufferPosition) as much as is allowed by the smaller
> > buffer. This way there is no restriction on using this method
> > vs. having read bytes already ("principle of least surprise").
>
> Yep sounds good. I can code this and commit it with LUCENE-430.
> If you want to do it, then I should commit the existing 430 patch
> soon, so that there are no conflicts in our patches?

I'm actually coding it up now.  Why don't you commit LUCENE-430
soonish and then I'll update & merge?


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498787 ]

Marvin Humphrey commented on LUCENE-888:
----------------------------------------

I would like to know why these gains are appearing, and how specific they are to a particular system.  How can the optimum buffer size be deduced?  Is it a factor of hard disk sector size?  Memory page size?  Lucene write behavior pattern? Level X Cache size?

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498793 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

> I'm actually coding it up now. Why don't you commit LUCENE-430
> soonish and then I'll update & merge?

Done.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499018 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

OK I ran two sets of tests.  First is only on Mac OS X to see how
performance changes with buffer sizes.  Second was also on Debian
Linux & Windows XP Pro.

The performance gains are 10-18% faster overall.


FIRST TEST

I increased buffer sizes, separately, for each of BufferedIndexInput,
BufferedIndexOutput and CompoundFileWriter.  Each test is run once on
Mac OS X:

  BufferedIndexInput

      1 K   622 sec (current trunk)
      4 K   607 sec
      8 K   606 sec
     16 K   598 sec
     32 K   606 sec
     64 K   589 sec
    128 K   601 sec

  CompoundFileWriter

      1 K   622 sec (current trunk)
      4 K   599 sec
      8 K   591 sec
     16 K   578 sec
     32 K   583 sec
     64 K   580 sec

  BufferedIndexOutput

      1 K   622 sec (current trunk)
      4 K   588 sec
      8 K   576 sec
     16 K   551 sec
     32 K   566 sec
     64 K   555 sec
    128 K   543 sec
    256 K   534 sec
    512 K   564 sec

Comments:

  * The results are fairly noisy, but, performance does generally get
    better w/ larger buffers.

  * BufferedIndexOutput seems specifically to like very large output
    buffers; the other two seem to have less but still significant
    effect.

Given this I picked 16 K buffer for BufferedIndexOutput, 16 K buffer
for CompoundFileWriter and 4 K buffer for BufferedIndexInput. I think
we would get faster performance for a larger buffer for
BufferedIndexInput, but, even when merging there are quite a few of
these created (mergeFactor * N where N = number of separate index
files).

Then, I re-tested the baseline (trunk) & these buffer sizes across
platforms (below):



SECOND TEST

Baseline (trunk) = 1 K buffers for all 3.  New = 16 K for
BufferedIndexOutput, 16 K for CompoundFileWriter and 4 K for
BufferedIndexInput.

I ran each test 4 times & took the best time:

Quad core Mac OS X on 4-drive RAID 0
  baseline  622 sec
  new       527 sec
  -> 15% faster

Dual core Debian Linux (2.6.18 kernel) on 6 drive RAID 5
  baseline  708 sec
  new       635 sec
  -> 10% faster
 
Windows XP Pro laptop, single drive
  baseline  1604 sec
  new       1308 sec
  -> 18% faster

Net/net it's between 10-18% performance gain overall.  It is
interesting that the system with the "weakest" IO system (one drive on
Windows XP vs RAID 0/5 on the others) has the best gains.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499020 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------


> I would like to know why these gains are appearing, and how specific
> they are to a particular system. How can the optimum buffer size be
> deduced? Is it a factor of hard disk sector size? Memory page size?
> Lucene write behavior pattern? Level X Cache size?

It looks like the gains are cross platform (at least between OS X,
Linux, Windows XP) and cross-IO architecture.

I'm not sure how this depends/correlates to the various cache/page
sizes through the layers of OS -> disk heads.

It must be that doing an IO request has a fairly high overhead and so
the more bytes you can read/write at once the faster it is, since you
amortize that overhead.

For merging in particular, with mergeFactor=10, I can see that a
larger buffer size on the input streams should help reduce insane
seeks back & forth between the 10 files (and the 1 output file).
Maybe larger reads on the input streams also cause OS's IO scheduler
to do larger read-ahead in anticipation?

And some good news: these gains seem to be additive to the gains in
LUCENE-843, at least with my initial testing.


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499044 ]

John Haxby commented on LUCENE-888:
-----------------------------------

> Net/net it's between 10-18% performance gain overall. It is
> interesting that the system with the "weakest" IO system (one drive on
> Windows XP vs RAID 0/5 on the others) has the best gains.

Actually, it's not that surprising.  Linux and BSD (MacOS) kernels work hard to do good I/O without the user having to do that much to take it into account.   The improvement you're seeing in those systems is as much to do with the fact that you're dealing with complete file system block sizes (4x4k) and complete VM page sizes (4x4k).   You'd probably see similar gains just going from 1k to 4k though: even "cp" benefits from using a 4k block size rather than 1k.  I'd guess that a 4k or 8k buffer would be best on Linux/MacOS and that you wouldn't see much difference going to 16k.  In fact, in the MacOS tests the big jump seems to be from 1k to 4k with smaller improvements thereafer.

I'm not that surprised by the WinXP changes: the I/O subsystem on a laptop is usually dire and anything that will cut down on the I/O is going to be a big help.  I would expect that the difference would be more dramatic with a FAT32 file system than it would be with NTFS though.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-888:
--------------------------------------

    Attachment: LUCENE-888.patch

Attached the patch based on above discussion.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499182 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

Mike,

I tested and reviewed your patch. It looks good and all tests pass! Two comments:

- Should we increase the buffer size for CompoundFileReader to 4KB not only for the merge mode but also for the normal read mode?
- In BufferedIndexInput.setBufferSize() a new buffer should only be allocated if the new size is different from the previous buffer size.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499199 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> I tested and reviewed your patch. It looks good and all tests pass!

Thanks!

> - Should we increase the buffer size for CompoundFileReader to 4KB
> not only for the merge mode but also for the normal read mode?

I'm a little nervous about that: I don't know the impact it will have
on searching, especially queries that heavily use skipping?

Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when
it does a read -- its own (since each CSIndexInput subclasses from
BufferedIndexInput) and then the main stream of the
CompoundFileReader.  It seems like we shouldn't do this?  We should
not do a double copy.

It almost seems like the double copy would not occur becaase
readBytes() has logic to read directly from the underlying stream if
the sizes is >= bufferSize.  However, I see at least one case during
merging where this logic doesn't kick in (and we do a double buffer
copy) because the buffers become "skewed".  I will open a separate
issue for this; I think we should fix it and gain some more
performance.

> In BufferedIndexInput.setBufferSize() a new buffer should only be
> allocated if the new size is different from the previous buffer
> size.

Ahh, good.  Will do.

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499206 ]

Michael Busch commented on LUCENE-888:
--------------------------------------

> I'm a little nervous about that: I don't know the impact it will have
> on searching, especially queries that heavily use skipping?

Doesn't the OS always read at least a whole block from the disk (usually
4 KB)? If the answer is yes, then we don't safe IO by limiting the buffer
size to 1 KB, right? Of course it makes sense to limit the size for
streams that we clone often (like the freq stream) to safe memory and
array copies. But we never clone the base stream in the cfs reader.

> Hmmm, actually, a CSIndexInput potentially goes through 2 buffers when
> it does a read -- its own (since each CSIndexInput subclasses from
> BufferedIndexInput) and then the main stream of the
> CompoundFileReader. It seems like we shouldn't do this? We should
> not do a double copy.
>
> It almost seems like the double copy would not occur becaase
> readBytes() has logic to read directly from the underlying stream if
> the sizes is >= bufferSize. However, I see at least one case during
> merging where this logic doesn't kick in (and we do a double buffer
> copy) because the buffers become "skewed". I will open a separate
> issue for this; I think we should fix it and gain some more
> performance.

Good catch! Reminds me a bit of LUCENE-431 where we also did double
buffering for the RAMInputStream and RAMOutputStream. Yes, I think
we should fix this.


> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499209 ]

Michael McCandless commented on LUCENE-888:
-------------------------------------------

> > I'm a little nervous about that: I don't know the impact it will have
> > on searching, especially queries that heavily use skipping?
>
> Doesn't the OS always read at least a whole block from the disk
> (usually 4 KB)? If the answer is yes, then we don't safe IO by
> limiting the buffer size to 1 KB, right? Of course it makes sense to
> limit the size for streams that we clone often (like the freq
> stream) to safe memory and array copies. But we never clone the base
> stream in the cfs reader.

Yes, I think you're right.  But we should test search
performance on a large index & "typical" queries to be sure?  I will
open a new issue to track this...

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-888) Improve indexing performance by increasing internal buffer sizes

Nick Burch (Jira)
In reply to this post by Nick Burch (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499212 ]

Doug Cutting commented on LUCENE-888:
-------------------------------------

> then we don't save IO by limiting the buffer size to 1 KB

I'm confused by this.  My assumption is that, when you make a request to read 1k from a disk file, that the OS reads substantially more than 1k from the disk and places it in the buffer cache.  (The cost of randomly reading 1k is nearly the same as randomly reading 100k--both are dominated by seek.) So, if you make another request to read 1k shortly thereafter you'll get it from the buffer cache and the incremental cost will be that of making a system call.

In general, one should thus rely on the buffer cache and read-ahead, and make input buffers only big enough so that system call overhead is insignificant.  An alternate strategy is to not trust the buffer cache and read-ahead, but rather to make your buffers large enough so that transfer time dominates seeks.  This can require 1MB or larger buffers, so isn't always practical.

So, back to your statement, a 1k buffer doesn't save physical i/o, but nor should it incur extra physical i/o.  It does incur extra system calls, but uses less memory, which is a tradeoff.  Is that what you meant?

> Improve indexing performance by increasing internal buffer sizes
> ----------------------------------------------------------------
>
>                 Key: LUCENE-888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-888.patch
>
>
> In working on LUCENE-843, I noticed that two buffer sizes have a
> substantial impact on overall indexing performance.
> First is BufferedIndexOutput.BUFFER_SIZE (also used by
> BufferedIndexInput).  Second is CompoundFileWriter's buffer used to
> actually build the compound file.  Both are now 1 KB (1024 bytes).
> I ran the same indexing test I'm using for LUCENE-843.  I'm indexing
> ~5,500 byte plain text docs derived from the Europarl corpus
> (English).  I index 200,000 docs with compound file enabled and term
> vector positions & offsets stored plus stored fields.  I flush
> documents at 16 MB RAM usage, and I set maxBufferedDocs carefully to
> not hit LUCENE-845.  The resulting index is 1.7 GB.  The index is not
> optimized in the end and I left mergeFactor @ 10.
> I ran the tests on a quad-core OS X 10 machine with 4-drive RAID 0 IO
> system.
> At 1 KB (current Lucene trunk) it takes 622 sec to build the index; if
> I increase both buffers to 8 KB it takes 554 sec to build the index,
> which is an 11% overall gain!
> I will run more tests to see if there is a natural knee in the curve
> (buffer size above which we don't really gain much more performance).
> I'm guessing we should leave BufferedIndexInput's default BUFFER_SIZE
> at 1024, at least for now.  During searching there can be quite a few
> of this class instantiated, and likely a larger buffer size for the
> freq/prox streams could actually hurt search performance for those
> searches that use skipping.
> The CompoundFileWriter buffer is created only briefly, so I think we
> can use a fairly large (32 KB?) buffer there.  And there should not be
> too many BufferedIndexOutputs alive at once so I think a large-ish
> buffer (16 KB?) should be OK.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12