[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16736351#comment-16736351 ]

Caleb Ott commented on TIKA-2802:

It looks like manually adding the xercesImpl dependency to my project resolved the issue!
// build.gradle dependencies
// https://mvnrepository.com/artifact/xerces/xercesImpl
compile group: 'xerces', name: 'xercesImpl', version: '2.12.0'
I didn't need to add the "-Djavax.xml.parsers..." command line arguments, and I also switched back to the 1.20 Tika release.

It looks like it was using the xerces version that was built in to Java before I added the xerces dependency, which seems pretty outdated. Should Tika be adding that dependency automatically, or is it expected that we should add that dependency ourselves if we want to be using xerces2?

Note: I'll have to do some more in depth testing to make sure the issue is fully resolved, but it fixed the test scenario I was using. 

> Out of memory issues when extracting large files (pst)
> ------------------------------------------------------
>                 Key: TIKA-2802
>                 URL: https://issues.apache.org/jira/browse/TIKA-2802
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>         Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>            Reporter: Caleb Ott
>            Priority: Critical
>         Attachments: Selection_111.png, Selection_117.png
> I have an application that extracts text from multiple files on a file share. I've been running into issues with the application running out of memory (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very large char arrays and never releasing that memory. In the picture you can see the heap dump with 4 SAXParsers holding onto a large chunk of memory. The fourth one is expanded to show it is all being held by the "fDTDDecl" field. This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if this is a bug with xerces instead? I can easily reproduce the issue by attempting to extract text from large .pst files.

This message was sent by Atlassian JIRA