[jira] Created: (TIKA-211) memory issue in ExcelExtractor

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (TIKA-211) memory issue in ExcelExtractor

JIRA jira@apache.org
memory issue in ExcelExtractor
------------------------------

                 Key: TIKA-211
                 URL: https://issues.apache.org/jira/browse/TIKA-211
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
            Reporter: Daan de Wit


The excel extractor consumes lots and lots of memory when given an excel file containing a lot of numeric cells. I tested using a simple sheet containing 254 columns and 5511 rows resulting in an 8MB big file, this blowed with an OOME when given 512MB.
The memory issue is caused by the java NumberFormat that is instantiated for every numeric cell. A solution would be to cache the NumberFormat instance in the TikaHSSFListener class. Since NumberFormat is not thread-safe, it might be necessary to pool it.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (TIKA-211) memory issue in ExcelExtractor

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/TIKA-211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-211.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.4
         Assignee: Jukka Zitting

Thanks! Fixed in revision 757719.

PS. We don't need to worry about thread-safety as long as the NumberFormat instances are local to the parse() method, which is how I implemented this for now.

> memory issue in ExcelExtractor
> ------------------------------
>
>                 Key: TIKA-211
>                 URL: https://issues.apache.org/jira/browse/TIKA-211
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: Daan de Wit
>            Assignee: Jukka Zitting
>             Fix For: 0.4
>
>
> The excel extractor consumes lots and lots of memory when given an excel file containing a lot of numeric cells. I tested using a simple sheet containing 254 columns and 5511 rows resulting in an 8MB big file, this blowed with an OOME when given 512MB.
> The memory issue is caused by the java NumberFormat that is instantiated for every numeric cell. A solution would be to cache the NumberFormat instance in the TikaHSSFListener class. Since NumberFormat is not thread-safe, it might be necessary to pool it.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.