[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477480#comment-16477480 ]

feng ye commented on TIKA-2643:
-------------------------------

For information the whole suite of 257 files including this file got processed within 30 seconds over Hortonworks cluster, while over Cloudera processing this file alone hangs over 10 mins util MR job times out.  

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, testJournalParser.pdf
>
>
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can process some other pdf files on the same cluster. I am attaching the file and the syslog as well as stdout logs. Interesting that the same file can be processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to Cloudera cluster, a major flavor of Hadoop, so your timely attention would be very much appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)