[jira] Created: (TIKA-45) RereadableInputStream needs to be able to read to the end of the original stream on first rewind.

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-45) RereadableInputStream needs to be able to read to the end of the original stream on first rewind.

Jorge Spinsanti (Jira)
RereadableInputStream needs to be able to read to the end of the original stream on first rewind.
-------------------------------------------------------------------------------------------------

                 Key: TIKA-45
                 URL: https://issues.apache.org/jira/browse/TIKA-45
             Project: Tika
          Issue Type: Improvement
          Components: general
    Affects Versions: 0.1-incubator
            Reporter: Keith R. Bennett
             Fix For: 0.1-incubator


RereadableInputStream reads a stream's content into a store (memory or file) on its first pass.  If rewind() is called before end of stream is reached, the bytes not yet read will not be available on subsequent reads of the RereadableInputStream.  This could be a problem, for example, if a parser uses it to get metadata from the beginning of a stream and calls rewind(), expecting to get the entire document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (TIKA-45) RereadableInputStream needs to be able to read to the end of the original stream on first rewind.

Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith R. Bennett updated TIKA-45:
---------------------------------

    Attachment: RereadableInputStreamTest.java
                RereadableInputStream.java
                tika45.patch

I've attached both a patch, and the patched source files for your convenience in viewing.

Changes to the RereadableInputStream include:

* Addresses this issue by defaulting to reading until the end of the original input stream on the first rewind, but also provides a constructor with a boolean value specifying whether or not to do this.

* Added javadoc.

Thanks to Chris Mattmann for his suggestion regarding this issue.

As you can see, this class has a unit test, but given its importance, more testing would be a Good Thing.

I'm pasting here a TODO comment from the file because it describes what I think is a better solution to the problem:

    // TODO: At some point it would be better to replace the current approach
    // (specifying the above) with more automated behavior.  The stream could
    // keep the original stream open until EOF was reached.  For example, if:
    //
    // the original stream is 10 bytes, and
    // only 2 bytes are read on the first pass
    // rewind() is called
    // 5 bytes are read
    //
    // In this case, this instance gets the first 2 from its store,
    // and the next 3 from the original stream, saving those additional 3
    // bytes in the store.  In this way, only the maximum number of bytes
    // ever needed must be saved in the store; unused bytes are never read.
    // The original stream is closed when EOF is reached, or when close()
    // is called, whichever comes first.  Using this approach eliminates
    // the need to specify the flag (though makes implementation more complex).

- Keith

> RereadableInputStream needs to be able to read to the end of the original stream on first rewind.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-45
>                 URL: https://issues.apache.org/jira/browse/TIKA-45
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika45.patch
>
>
> RereadableInputStream reads a stream's content into a store (memory or file) on its first pass.  If rewind() is called before end of stream is reached, the bytes not yet read will not be available on subsequent reads of the RereadableInputStream.  This could be a problem, for example, if a parser uses it to get metadata from the beginning of a stream and calls rewind(), expecting to get the entire document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-45) RereadableInputStream needs to be able to read to the end of the original stream on first rewind.

Jorge Spinsanti (Jira)
In reply to this post by Jorge Spinsanti (Jira)

     [ https://issues.apache.org/jira/browse/TIKA-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-45.
-------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Committed patch in revision 582698.

> RereadableInputStream needs to be able to read to the end of the original stream on first rewind.
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-45
>                 URL: https://issues.apache.org/jira/browse/TIKA-45
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>            Assignee: Jukka Zitting
>             Fix For: 0.1-incubator
>
>         Attachments: RereadableInputStream.java, RereadableInputStreamTest.java, tika45.patch
>
>
> RereadableInputStream reads a stream's content into a store (memory or file) on its first pass.  If rewind() is called before end of stream is reached, the bytes not yet read will not be available on subsequent reads of the RereadableInputStream.  This could be a problem, for example, if a parser uses it to get metadata from the beginning of a stream and calls rewind(), expecting to get the entire document.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.