[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738688#comment-16738688 ]

Caleb Ott commented on TIKA-2802:
---------------------------------

Tim, I have done more testing and it looks like the issue is fully resolved.

It makes sense to me to include xerces2 with Tika. If not, developers can add the dependency themselves fairly easily. If it does not get included with Tika, some documentation showing how and why to use xerces2 would be nice. The xerces that comes with Java seems pretty buggy and outdated.

> Out of memory issues when extracting large files (pst)
> ------------------------------------------------------
>
>                 Key: TIKA-2802
>                 URL: https://issues.apache.org/jira/browse/TIKA-2802
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>         Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>  
>            Reporter: Caleb Ott
>            Priority: Critical
>         Attachments: Selection_111.png, Selection_117.png
>
>
> I have an application that extracts text from multiple files on a file share. I've been running into issues with the application running out of memory (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers holding onto a large chunk of memory. The fourth one is expanded to show it is all being held by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if this is a bug with xerces instead? I can easily reproduce the issue by attempting to extract text from large .pst files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)