[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (TIKA-2643) Tika call hangs when processes a pdf on Cloudera Hadoop

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/TIKA-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16477486#comment-16477486 ]

Tim Allison commented on TIKA-2643:

bq. The tricky part is I cannot attach a debugger against this call within MapReduce job over the cluster.

Ugh.  Right. Of course.  Anything more you can do with logging?  I didn't read through your logs well enough, but can you confirm that the hang is happening during parseToString() and not immediately after it?

Without understanding your full framework, I can't think of what might be causing this with any accuracy. :)

Some things that have caused permanent hangs for me in the past:
1) not clearing stderr/stdout from a child process
2) infinite loops in parsers
3) blocking IO that, well, blocks
4) calling take() instead of poll() on an ExecutorCompletionService that is blocking
5) well, more generally, calling any of the blocking methods on theoretically concurrent/non-blocking objects, ArrayBlockingQueue, etc. instead of calling the non-blocking alternatives
6) Not-quite a permanent hang, but crazy churn caused by multithreaded garbage collection

I don't think this is the fault of the parser (2 above).  We can see from the logs, that the parser is making at least some progress into the file.

Do any of the above look like candidates for you?

> Tika call hangs when processes a pdf on Cloudera Hadoop
> -------------------------------------------------------
>                 Key: TIKA-2643
>                 URL: https://issues.apache.org/jira/browse/TIKA-2643
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.17
>         Environment: Cloudera Hadoop 5.8
>            Reporter: feng ye
>            Priority: Blocker
>         Attachments: hang-stdout.txt, hang.zip, testJournalParser.pdf
> Tika.parseToString(InputStream) hangs when called within a MapReduce job to process a pdf file from Cloudera Hadoop 5.8 (observed on 5.4 too). It can process some other pdf files on the same cluster. I am attaching the file and the syslog as well as stdout logs. Interesting that the same file can be processed fine over a Hortonworks cluster. 
> This issue is a blocker for us to make our feature based on Tika available to Cloudera cluster, a major flavor of Hadoop, so your timely attention would be very much appreciated.

This message was sent by Atlassian JIRA