[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-3097) Out of memory while parsing docx

Clark Perkins (Jira)

    [ https://issues.apache.org/jira/browse/TIKA-3097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17101584#comment-17101584 ]

Tim Allison commented on TIKA-3097:

Uncompressed, you're looking at ~150MB for the file.  xml beans on top of that add quite a bit of overhead...2 gb sounds excessive.  There is a streaming option for docx and pptx:https://cwiki.apache.org/confluence/display/TIKA/MSOfficeParsers

I'll take a look in the debugger later today and let you know if this is a bug or feature.

> Out of memory while parsing docx
> --------------------------------
>                 Key: TIKA-3097
>                 URL: https://issues.apache.org/jira/browse/TIKA-3097
>             Project: Tika
>          Issue Type: Bug
>          Components: core, parser
>    Affects Versions: 1.24
>            Reporter: suchendra
>            Priority: Major
>         Attachments: test.docx
> I have written simple Scala code to extract the content from uploaded file which is docx. JVM goes OOM when tika tries to parse the file. I have configured JVM heap to 1GB and tried with 2GB same issue occurs, issue both with jar as well as in my code.
> Attached the file for reference.

This message was sent by Atlassian Jira