PDFBox log file locks Fetcher

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

PDFBox log file locks Fetcher

Sebastian Nagel | exorbyte
Hi,

I stumble with a hanging Fetcher.
The result is similar to that described in NUTCH-719 and NUTCH-721:
Long waiting then "Aborting with xxx hung threads."

I tracked it down and found that the hung FetcherThreads are processing exclusively PDF documents.
Now the penny dropped and I connected this with some strange files I observed in $PWD after failed
crawls:  PDFBox.log  PDFBox.log.1  PDFBox.log.1.lck

I found that it is a known issue for PDFbox that a file PDFBox.log is written in the current directory:
http://www.mail-archive.com/pdfbox-users@.../msg00344.html
(the code is still unchanged!)

Indeed it seems that two or more FetcherThreads are blocking each other when accidentally called
simultaneously to parse a PDF document.

When removing parse-pdf from the "plugin.includes" the problem dissappears.

Has anyone a quick fix for this problem? Of course, I want to get content also from PDF files.

Thanks,

Sebastian
Reply | Threaded
Open this post in threaded view
|

Re: PDFBox log file locks Fetcher

Otis Gospodnetic-2-2
I don't have a fix, but I have a suggestion - have you tried using the very latest version of PDFBox?  I believe it's going through Apache Incubator... aha, here: http://incubator.apache.org/pdfbox/

Too bad the page doesn't say *when* the release was made, so one can get a sense of the state of the project.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----

> From: Sebastian Nagel <[hidden email]>
> To: [hidden email]
> Sent: Tuesday, August 4, 2009 12:48:22 PM
> Subject: PDFBox log file locks Fetcher
>
> Hi,
>
> I stumble with a hanging Fetcher.
> The result is similar to that described in NUTCH-719 and NUTCH-721:
> Long waiting then "Aborting with xxx hung threads."
>
> I tracked it down and found that the hung FetcherThreads are processing
> exclusively PDF documents.
> Now the penny dropped and I connected this with some strange files I observed in
> $PWD after failed
> crawls:  PDFBox.log  PDFBox.log.1  PDFBox.log.1.lck
>
> I found that it is a known issue for PDFbox that a file PDFBox.log is written in
> the current directory:
> http://www.mail-archive.com/pdfbox-users@.../msg00344.html
> (the code is still unchanged!)
>
> Indeed it seems that two or more FetcherThreads are blocking each other when
> accidentally called
> simultaneously to parse a PDF document.
>
> When removing parse-pdf from the "plugin.includes" the problem dissappears.
>
> Has anyone a quick fix for this problem? Of course, I want to get content also
> from PDF files.
>
> Thanks,
>
> Sebastian

Reply | Threaded
Open this post in threaded view
|

Re: PDFBox log file locks Fetcher

Sebastian Nagel | exorbyte
Dear Otis,

I checked out the latest version and successfully built "trunk", the jar file is named pdfbox*0.8.jar.
I still have to reconfigure Nutch's parse-pdf's build.xml (I hope not more).

But the code writing to $PWD/PDFBox.log is still in PDFBox. I guess it must be removed or changed
otherwise you will run into serious problems when using PDFBox in a concurrent environment such as
Nutch. So, someone should get in touch with the PDFBox developers. Shall I?

Another suggestion would be (additionally) to run the parser in an own thread controlled by the
FetcherThread which stops the parser thread when a timeout is reached. I mean parser plugins are
always based on a lot of external libraries such as PDFBox, so you may run again into a similar
situation.

Sebastian

Otis Gospodnetic wrote:

> I don't have a fix, but I have a suggestion - have you tried using the very latest version of PDFBox?  I believe it's going through Apache Incubator... aha, here: http://incubator.apache.org/pdfbox/
>
> Too bad the page doesn't say *when* the release was made, so one can get a sense of the state of the project.
>
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Sebastian Nagel <[hidden email]>
>> To: [hidden email]
>> Sent: Tuesday, August 4, 2009 12:48:22 PM
>> Subject: PDFBox log file locks Fetcher
>>
>> Hi,
>>
>> I stumble with a hanging Fetcher.
>> The result is similar to that described in NUTCH-719 and NUTCH-721:
>> Long waiting then "Aborting with xxx hung threads."
>>
>> I tracked it down and found that the hung FetcherThreads are processing
>> exclusively PDF documents.
>> Now the penny dropped and I connected this with some strange files I observed in
>> $PWD after failed
>> crawls:  PDFBox.log  PDFBox.log.1  PDFBox.log.1.lck
>>
>> I found that it is a known issue for PDFbox that a file PDFBox.log is written in
>> the current directory:
>> http://www.mail-archive.com/pdfbox-users@.../msg00344.html
>> (the code is still unchanged!)
>>
>> Indeed it seems that two or more FetcherThreads are blocking each other when
>> accidentally called
>> simultaneously to parse a PDF document.
>>
>> When removing parse-pdf from the "plugin.includes" the problem dissappears.
>>
>> Has anyone a quick fix for this problem? Of course, I want to get content also
>> from PDF files.
>>
>> Thanks,
>>
>> Sebastian
>

Reply | Threaded
Open this post in threaded view
|

Re: PDFBox log file locks Fetcher

Sebastian Nagel | exorbyte
Just to pinpoint how serious it was for may crawl:

The crawl was started with 4000 seed URLs, all from different hosts,
and the following options resp. properties:
 threads = 10
 depth = 7
 generate.max.per.host = 7
 topN = 28000 (4000*7)

I grepped in the log file how many documents have been fetched per cycle,
and how many threads have benn hung:

cycle fetching hung_threads
1 3863     0
2 16624     1
3  943    10
4  296    10
5  134    10
6   61    10
7   50    10

The number of crawled documents per cycle converges to zero due to the blocked FetcherThreads.
For every new cycle unfetched PDF documents fill the queue more and more.


I now have a solution. I had to patch Nutch's parse-pdf plugin as well as PDFBox.
The first trial with the current version of PDFBox didn't change anything.
The code opening the log file PDFBox.log has not changed.

For PDFBox (trunk from svn) apply the patch:

% svn diff src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
Index: src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java
===================================================================
--- src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java       (Revision 801087)
+++ src/main/java/org/apache/pdfbox/exceptions/LoggingObject.java       (Arbeitskopie)
@@ -35,10 +35,10 @@

        //http://www.rgagnon.com/javadetails/java-0501.html
        if (logger_ == null){
-               FileHandler fh = new FileHandler("PDFBox.log", true);
-               fh.setFormatter(new SimpleFormatter());
+               // FileHandler fh = new FileHandler("PDFBox.log", true);
+               // fh.setFormatter(new SimpleFormatter());
                logger_ = Logger.getLogger("TestLog");
-               logger_.addHandler(fh);
+               // logger_.addHandler(fh);

             /*Set the log level here.
             The lower your logging level, the more stuff will be logged.

(Of course, commenting out is not a real solution.)
Then run ant in trunk/ to build PDFBox and copy
 <PFDBox>/trunk/lib/pdfbox-0.8.0-incubating.jar
 <PFDBox>/trunk/external/fontbox-0.8.0-incubating.jar
 <PFDBox>/trunk/external/jempbox-0.8.0-incubating.jar
to <Nutch>/src/plugin/parse-pdf/lib/
(Still TODO: renew license files)



Now apply the patches to Nutch 1.0 parse-pdf:

src/plugin/parse-pdf/plugin.xml

--- src/plugin/parse-pdf/plugin.xml~ 2009-03-23 20:04:06.000000000 +0100
+++ src/plugin/parse-pdf/plugin.xml  2009-08-05 11:28:06.000000000 +0200
@@ -26,9 +26,9 @@
       <library name="parse-pdf.jar">
          <export name="*"/>
       </library>
-      <library name="PDFBox-0.7.4-dev.jar"/>
-      <library name="FontBox-0.2.0-dev.jar"/>
-      <library name="JempBox-0.2.0-dev.jar"/>
+      <library name="pdfbox-0.8.0-incubating.jar"/>
+      <library name="fontbox-0.8.0-incubating.jar"/>
+      <library name="jempbox-0.8.0-incubating.jar"/>
       <library name="bcprov-jdk14-132.jar"/>
       <!-- Uncomment the following two lines after you have downloaded the
            libraries, see README.txt for more details.-->



src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java

--- src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java~ 2009-03-23 20:04:01.000000000 +0100
+++ src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf/PdfParser.java  2009-08-05 11:16:35.000000000 +0200
@@ -17,14 +17,14 @@

 package org.apache.nutch.parse.pdf;

-import org.pdfbox.pdfparser.PDFParser;
-import org.pdfbox.pdmodel.PDDocument;
-import org.pdfbox.pdmodel.PDDocumentInformation;
-import org.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
-import org.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
-import org.pdfbox.util.PDFTextStripper;
+import org.apache.pdfbox.pdfparser.PDFParser;
+import org.apache.pdfbox.pdmodel.PDDocument;
+import org.apache.pdfbox.pdmodel.PDDocumentInformation;
+import org.apache.pdfbox.pdmodel.encryption.BadSecurityHandlerException;
+import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
+import org.apache.pdfbox.util.PDFTextStripper;

-import org.pdfbox.exceptions.CryptographyException;
+import org.apache.pdfbox.exceptions.CryptographyException;

 // Commons Logging imports
 import org.apache.commons.logging.Log;


... and build Nutch. I tested

What about further steps? Is there a maintainer for parse-pdf?

Bye, Sebastian