Large xls files always loaded into memory?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Large xls files always loaded into memory?

Mark Barton2
From the Tika documentation (lucene.apache.org/tika/documentation.html), I read that Tika uses streamed parsing, and that "This allows even huge documents to be parsed without excessive resource requirements."  But it seems that my large xls file (240 megs) is being pulled completely into RAM, which crashes when the heap is full.  The Tika class OfficeParser uses org.apache.poi.poifs.storage.POIFSFFileSystem, and in the debugger I see the following line (source comment included) being executed in that class:  

     // read the rest of the stream into blocks
     data_blocks = new RawDataBlockList(stream, bigBlockSize);

It does indeed seem to be trying to read the entire 240 megs into blocks.  Am I missing something?  My main motivation for using Tika is that it seemed to offer a way to process large xls files without pulling them into memory.

Thanks for any insights you can offer.
Reply | Threaded
Open this post in threaded view
|

Re: Large xls files always loaded into memory?

Jukka Zitting
Hi,

Sorry for the late reply.

On Thu, Apr 16, 2009 at 11:38 PM, Mark Barton2 <[hidden email]> wrote:
> From the Tika documentation (lucene.apache.org/tika/documentation.html), I
> read that Tika uses streamed parsing, and that "This allows even huge
> documents to be parsed without excessive resource requirements."

Yes, that's one of the key design criteria for the Tika Parser interface.

However, not all of the parser implementations are yet fully compliant
with this design goal.

> But it seems that my large xls file (240 megs) is being pulled completely into
> RAM, which crashes when the heap is full.  The Tika class OfficeParser uses
> org.apache.poi.poifs.storage.POIFSFFileSystem, and in the debugger I see the
> following line (source comment included) being executed in that class:
>
>     // read the rest of the stream into blocks
>     data_blocks = new RawDataBlockList(stream, bigBlockSize);
>
> It does indeed seem to be trying to read the entire 240 megs into blocks.

Yeah, that seems unfortunate. I'm not too into the POI internals, but
I was always under the impression that it would just keep a list of
data block _references_ in memory and would load the actual data only
when needed. Maybe I'm mistaken.

Anyway, it would be good to contact the POI project for more input on
this. We're already using the HSSF Event API that designed for
streaming, but perhaps there are some extra options that we should be
using.

Or then we simply need to fix this is POI. The "What's Next?" section
in [1] mentions performance ("POI currently uses a lot of memory for
large sheets") as an area of future improvement.

[1] http://poi.apache.org/spreadsheet/how-to.html

BR,

Jukka Zitting