[jira] [Commented] (TIKA-2496) TIKA crashes / runs out of memory on simple PDF

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2496) TIKA crashes / runs out of memory on simple PDF

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250251#comment-16250251 ]

Tim Allison commented on TIKA-2496:
-----------------------------------

I regret that I don't think we can do much unless you can share a triggering file.  Have you had a chance to try tika-app-1.16.jar or a nightly build, say: https://builds.apache.org/job/Tika-trunk/1387/org.apache.tika$tika-app/artifact/org.apache.tika/tika-app/1.17-20171108.165131-76/tika-app-1.17-20171108.165131-76.jar 

> TIKA crashes / runs out of memory on simple PDF
> -----------------------------------------------
>
>                 Key: TIKA-2496
>                 URL: https://issues.apache.org/jira/browse/TIKA-2496
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.15
>         Environment: Linux, Java 8
>            Reporter: chelambarasan
>
> We're using TIKA embedded in a webcrawler and today I've encountered a PDF that results in OutOfMemory errors while being processed by TIKA.
> Tried with Xmx 5gb and pdf file sizes are approximately 50 mb.
> Tika version: 1.15
> Error as below:
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at org.apache.pdfbox.io.ScratchFileBuffer.addPage(ScratchFileBuffer.java:132)
> at org.apache.pdfbox.io.ScratchFileBuffer.ensureAvailableBytesInPage(ScratchFileBuffer.java:184)
> at org.apache.pdfbox.io.ScratchFileBuffer.write(ScratchFileBuffer.java:236)
> at org.apache.pdfbox.io.RandomAccessOutputStream.write(RandomAccessOutputStream.java:46)
> at org.apache.pdfbox.cos.COSStream$2.write(COSStream.java:266)
> at org.apache.pdfbox.pdfparser.COSParser.readValidStream(COSParser.java:1142)
> at org.apache.pdfbox.pdfparser.COSParser.parseCOSStream(COSParser.java:970)
> at org.apache.pdfbox.pdfparser.COSParser.parseFileObject(COSParser.java:781)
> at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:742)
> at org.apache.pdfbox.pdfparser.COSParser.parseObjectDynamically(COSParser.java:673)
> at org.apache.pdfbox.pdfparser.COSParser.parseDictObjects(COSParser.java:633)
> at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:241)
> at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:276)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1132)
> at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1066)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:141)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
> Please let us know how to fix this issue



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)