[jira] Created: (LUCENE-892) CompoundFileReader's openInput produces streams that may do an extra buffer copy

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (LUCENE-892) CompoundFileReader's openInput produces streams that may do an extra buffer copy

Tim Allison (Jira)
CompoundFileReader's openInput produces streams that may do an extra buffer copy
--------------------------------------------------------------------------------

                 Key: LUCENE-892
                 URL: https://issues.apache.org/jira/browse/LUCENE-892
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
    Affects Versions: 2.1
            Reporter: Michael McCandless
         Assigned To: Michael McCandless
            Priority: Minor


Spinoff of LUCENE-888.

The class for reading from a compound file (CompoundFileReader) has a
primary stream which is a BufferedIndexInput when that stream is from
an FSDirectory (which is the norm).  That is one layer of buffering.

Then, when its openInput is called, a CSIndexInput is created which
also subclasses from BufferedIndexInput.  That's a second layer of
buffering.

When a consumer actually uses that CSIndexInput to read, and a call to
readByte or readBytes runs out of what's in the first buffer, it will
go to refill its buffer.  But that refill calls the first
BufferedIndexInput which in turn may refill its buffer (a double
copy) by reading the underlying stream.

Not sure how to fix it yet but we should change things to not do the
extra buffer copy.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (LUCENE-892) CompoundFileReader's openInput produces streams that may do an extra buffer copy

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/LUCENE-892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507466 ]

Michael McCandless commented on LUCENE-892:
-------------------------------------------

It looks like the double copying only happens in certain limited cases
with Lucene now.

During indexing, when doing a segment merge of multiple CFS files,
when the CSIndexInput reaches the end of its sub-file then that final
buffer fill will be a double copy.

If the flushing is small (like the default every 10 docs, with small
docs) then often files are so small that they are less than a buffer
size and all reads are double-copy.

During searching the same situation occurs (tail end of file that is
less than buffer size).

I had thought I saw a case before where the two buffers get "out of
sync" such that every buffer refill in CSIndexInput results in a
double copy even when you are not at the end of the file, but I can't
reproduce that case now.  This would have been much more serious.

So we are sort of getting "lucky" right now, but, if Lucene changes
how it uses CFS, it could get the buffers to be "mis-aligned" at which
point every buffer fill would be a double copy.  So I think we should
still fix it.

The solution I'm leaning towards is to add a new readBytes() method in
IndexInput that takes 4th argument "boolean useBuffer".  It would have
a default implementation that just ignores that parameter (and calls
the current readBytes).  BufferedIndexInput would implement that
method and not refill its buffer if that parameter is false, and
CSIndexInput would pass false when it calls base.readBytes in its
readInternal method.


> CompoundFileReader's openInput produces streams that may do an extra buffer copy
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-892
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>
> Spinoff of LUCENE-888.
> The class for reading from a compound file (CompoundFileReader) has a
> primary stream which is a BufferedIndexInput when that stream is from
> an FSDirectory (which is the norm).  That is one layer of buffering.
> Then, when its openInput is called, a CSIndexInput is created which
> also subclasses from BufferedIndexInput.  That's a second layer of
> buffering.
> When a consumer actually uses that CSIndexInput to read, and a call to
> readByte or readBytes runs out of what's in the first buffer, it will
> go to refill its buffer.  But that refill calls the first
> BufferedIndexInput which in turn may refill its buffer (a double
> copy) by reading the underlying stream.
> Not sure how to fix it yet but we should change things to not do the
> extra buffer copy.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-892) CompoundFileReader's openInput produces streams that may do an extra buffer copy

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-892:
--------------------------------------

    Attachment: LUCENE-892.patch

Attached patch that implements the design above (make an
IndexInput.readBytes that let's you specify whether or not a buffer
should be used).  All tests pass.


> CompoundFileReader's openInput produces streams that may do an extra buffer copy
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-892
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-892.patch
>
>
> Spinoff of LUCENE-888.
> The class for reading from a compound file (CompoundFileReader) has a
> primary stream which is a BufferedIndexInput when that stream is from
> an FSDirectory (which is the norm).  That is one layer of buffering.
> Then, when its openInput is called, a CSIndexInput is created which
> also subclasses from BufferedIndexInput.  That's a second layer of
> buffering.
> When a consumer actually uses that CSIndexInput to read, and a call to
> readByte or readBytes runs out of what's in the first buffer, it will
> go to refill its buffer.  But that refill calls the first
> BufferedIndexInput which in turn may refill its buffer (a double
> copy) by reading the underlying stream.
> Not sure how to fix it yet but we should change things to not do the
> extra buffer copy.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (LUCENE-892) CompoundFileReader's openInput produces streams that may do an extra buffer copy

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-892:
--------------------------------------

    Lucene Fields: [New, Patch Available]  (was: [New])

> CompoundFileReader's openInput produces streams that may do an extra buffer copy
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-892
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-892.patch
>
>
> Spinoff of LUCENE-888.
> The class for reading from a compound file (CompoundFileReader) has a
> primary stream which is a BufferedIndexInput when that stream is from
> an FSDirectory (which is the norm).  That is one layer of buffering.
> Then, when its openInput is called, a CSIndexInput is created which
> also subclasses from BufferedIndexInput.  That's a second layer of
> buffering.
> When a consumer actually uses that CSIndexInput to read, and a call to
> readByte or readBytes runs out of what's in the first buffer, it will
> go to refill its buffer.  But that refill calls the first
> BufferedIndexInput which in turn may refill its buffer (a double
> copy) by reading the underlying stream.
> Not sure how to fix it yet but we should change things to not do the
> extra buffer copy.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (LUCENE-892) CompoundFileReader's openInput produces streams that may do an extra buffer copy

Tim Allison (Jira)
In reply to this post by Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/LUCENE-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-892.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 2.3
    Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

> CompoundFileReader's openInput produces streams that may do an extra buffer copy
> --------------------------------------------------------------------------------
>
>                 Key: LUCENE-892
>                 URL: https://issues.apache.org/jira/browse/LUCENE-892
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.1
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: LUCENE-892.patch
>
>
> Spinoff of LUCENE-888.
> The class for reading from a compound file (CompoundFileReader) has a
> primary stream which is a BufferedIndexInput when that stream is from
> an FSDirectory (which is the norm).  That is one layer of buffering.
> Then, when its openInput is called, a CSIndexInput is created which
> also subclasses from BufferedIndexInput.  That's a second layer of
> buffering.
> When a consumer actually uses that CSIndexInput to read, and a call to
> readByte or readBytes runs out of what's in the first buffer, it will
> go to refill its buffer.  But that refill calls the first
> BufferedIndexInput which in turn may refill its buffer (a double
> copy) by reading the underlying stream.
> Not sure how to fix it yet but we should change things to not do the
> extra buffer copy.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]