[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477425#comment-16477425 ]

Tim Allison commented on TIKA-2643:
-----------------------------------

Tika does (rarley) hang on some files, and your processing pipeline needs to be able to handle OOM and permanent hangs.

I'm not able to reproduce a hang on this file with tika-app 1.17 or 1.18 so I don't think there's much I can do to help.

I see similar logs to what you have:
{noformat}
INFO  OpenType Layout tables used in font Times New Roman,Bold are not implemented in PDFBox and will be ignored
INFO  OpenType Layout tables used in font Arial are not implemented in PDFBox and will be ignored
INFO  OpenType Layout tables used in font Times New Roman,Italic are not implemented in PDFBox and will be ignored
WARN  Format 14 cmap table is not supported and will be ignored
INFO  OpenType Layout tables used in font ABCDEE+Cambria Math are not implemented in PDFBox and will be ignored
INFO  OpenType Layout tables used in font ABCDEE+Arial Unicode MS are not implemented in PDFBox and will be ignored
INFO  OpenType Layout tables used in font Times New Roman are not implemented in PDFBox and will be ignored
INFO  OpenType Layout tables used in font ABCDEE+SimSun are not implemented in PDFBox and will be ignored
WARN  Format 14 cmap table is not supported and will be ignored
INFO  OpenType Layout tables used in font Times New Roman,BoldItalic are not implemented in PDFBox and will be ignored
{noformat}

Any idea what's happening?

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can process some other pdf files on the same cluster. I am attaching the file and the syslog as well as stdout logs. Interesting that the same file can be processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to Cloudera cluster, a major flavor of Hadoop, so your timely attention would be very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)