I've been thinking about how our Parser interface takes an InputStream rather than a resource identifier (URL, File, String).
In order to accomplish the reading of an original resource only once, we have the RereadableInputStream. However, this presents the following potential problems due to the duplication of data in memory or on disk:
1) We are implementing the chunking of data using the SAX events. This allows us to break up a document into smaller parts. However, there is no such chunking with regards to the RereadableInputStream; it reads and stores the entire document.
2) Users need to be much more aware of their system's resources at all points in time during which Tika may be in use. This would require anticipating available disk storage, what other processes are running, etc.
3) In some environments, saving to disk is not practical due to performance or security concerns.
4) We introduce the risk of bringing down the JVM if the maximum memory is exceeded, and possibly worse if the disk runs out of free space.
5) The parser implementations themselves may store data and use large amounts of memory, so we may not have as much memory or disk available as we may think.
* * *
For casual uses, this will probably not be a problem. However, many users will need Tika to be robust and efficient even under high loads.
So I raise the question -- should we think about supporting multiple reads of a resource, at least as an option? Many users will work only with static resources such as files, and not be concerned about the data changing between reads. This would require changing the Parser interface, probably to take a URL rather than an InputStream.
Maybe this is not necessary though -- do we know to what extent parsers need to make multiple passes? And will they ever need the first pass to read more than just a small header? If not, then the BufferedInputStream's mark and release would work fine, and we would not need to store the read bytes ourselves, using RereadableInputStream or otherwise. I have no knowledge of the parser implementations, so I thought RereadableInputStream would cover the worst case. However, I'm now seeing that it presents problems of its own.
> ...In order to accomplish the reading of an original resource only once, we
> have the RereadableInputStream....
Is that used for all parsers?
If yes we should use it only where needed, by asking the parser if it
needs it or not. Or more precisely, if it needs a "no-rewind",
"small-rewind" or "all-rewind" input stream.
Then, we could document which parsers use which stream type, so that
people know which file types are likely to cause resource problems or
writes to disk.
I'd prefer this to be an internal concern of Tika, rather than putting
the burden on the user to decide if the input can be read several
times safely. Unless someone really needs that feature now, of course.
On 10/13/07, Bertrand Delacretaz <[hidden email]> wrote:
> On 10/13/07, Keith R. Bennett <[hidden email]> wrote:
> > ...In order to accomplish the reading of an original resource only once, we
> > have the RereadableInputStream....
> Is that used for all parsers?
I actually managed to get rid of the re-reading of the input stream in
the Microsoft parsers, see TIKA-63.
> I'd prefer this to be an internal concern of Tika, rather than putting
> the burden on the user to decide if the input can be read several
> times safely. Unless someone really needs that feature now, of course.
+1 The parser class should have the best knowledge on how many passes
will be required for reading the stream. Most of the times times I
guess a parser will either need just a single pass or will just read
the whole document to memory.