Parser.parse with file instead of stream

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parser.parse with file instead of stream

Stefano Fornari
Hi All,
I am using lucene in an embedded environment and I need to keep use of
memory under control. In investigating a problem with big pdf files (a few
Mb), I noticed that Parse.parse takes an InputStream as parameter but then
PDFParser has the following code:

TikaInputStream tstream = TikaInputStream.cast(stream);
            if (tstream != null && tstream.hasFile()) {
                // File based, take that as a cue to use a temporary file
                RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(stream), scratchFile);
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
                }
            } else {
                // Go for the normal, stream based in-memory parsing
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(stream), new RandomAccessBuffer());
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
                }
            }

I am not sure tstream.hasFile() can ever be true, from my understanding of
the code it can be only false. Therefore the "else" triggers and the stream
is managed in memory. I suspect this means the stream (or a good part of
it) is read in memory somewhere when managed, potentially using a lot of
memory.

I have then tried a different approach, adding a version of parse() that
accepts a file instead of a stream. The code above will then become:

TikaInputStream tstream = TikaInputStream.get(file);
            if (tstream != null && tstream.hasFile()) {
                // File based, take that as a cue to use a temporary file
                RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(tstream), scratchFile);
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(tstream), scratchFile, true);
                }
            } else {
                // Go for the normal, stream based in-memory parsing
                if (localConfig.getUseNonSequentialParser() == true){
                    pdfDocument = PDDocument.loadNonSeq(new
CloseShieldInputStream(tstream), new RandomAccessBuffer());
                } else {
                    pdfDocument = PDDocument.load(new
CloseShieldInputStream(tstream), true);
                }
            }

(but do we really need the && in the if?)

This is much more friendly with memory usage; with the first version of the
method I could not parse a file of 4.3Mb running the JVM with 16M while I
have parsed it successfully with the second approach.

What do you think about extending the Parse interface accordingly? would
you be interested in a patch that does it?

Ste
Reply | Threaded
Open this post in threaded view
|

Re: Parser.parse with file instead of stream

Jukka Zitting
Hi,

On Thu, Mar 27, 2014 at 6:07 PM, Stefano Fornari
<[hidden email]> wrote:
> I am not sure tstream.hasFile() can ever be true, from my understanding of
> the code it can be only false.

It's true if you call the parser like this:

    InputStream stream = TikaInputStream.get(file);
    try {
        parser.parse(stream, ...);
    } finally {
        stream.close();
    }

> What do you think about extending the Parse interface accordingly?

See https://issues.apache.org/jira/browse/TIKA-153 (and the
TikaInputStream javadocs) for details on how we already achieve this
functionality.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Parser.parse with file instead of stream

Stefano Fornari
that worked! thanks.

Ste


On Thu, Mar 27, 2014 at 11:24 PM, Jukka Zitting <[hidden email]>wrote:

> Hi,
>
> On Thu, Mar 27, 2014 at 6:07 PM, Stefano Fornari
> <[hidden email]> wrote:
> > I am not sure tstream.hasFile() can ever be true, from my understanding
> of
> > the code it can be only false.
>
> It's true if you call the parser like this:
>
>     InputStream stream = TikaInputStream.get(file);
>     try {
>         parser.parse(stream, ...);
>     } finally {
>         stream.close();
>     }
>
> > What do you think about extending the Parse interface accordingly?
>
> See https://issues.apache.org/jira/browse/TIKA-153 (and the
> TikaInputStream javadocs) for details on how we already achieve this
> functionality.
>
> BR,
>
> Jukka Zitting
>