[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737283#comment-16737283 ]

Tim Allison commented on TIKA-2802:

+- org.apache.ctakes:ctakes-core:jar:4.0.0:provided
[INFO] |  +- org.apache.ctakes:ctakes-core-res:jar:4.0.0:provided
[INFO] |  +- xerces:xercesImpl:jar:2.11.0:provided

That would explain why I'm seeing xerces in my dev environment, and you're not seeing it when you pull it in.

Given your findings, it makes sense to me include xerces2.  Fellow devs, any objections?

> Out of memory issues when extracting large files (pst)
> ------------------------------------------------------
>                 Key: TIKA-2802
>                 URL: https://issues.apache.org/jira/browse/TIKA-2802
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>         Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>            Reporter: Caleb Ott
>            Priority: Critical
>         Attachments: Selection_111.png, Selection_117.png
> I have an application that extracts text from multiple files on a file share. I've been running into issues with the application running out of memory (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers holding onto a large chunk of memory. The fourth one is expanded to show it is all being held by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if this is a bug with xerces instead? I can easily reproduce the issue by attempting to extract text from large .pst files.

This message was sent by Atlassian JIRA