Parsing PDF Nutch Achilles heel?

classic Classic list List threaded Threaded
10 messages Options
hk-
Reply | Threaded
Open this post in threaded view
|

Parsing PDF Nutch Achilles heel?

hk-
I have been doing some testing on different nutch configurations to see
what slows down the fetching process on my servers(nutch 0.7.1).
My general experience is that the PDF parse process is nutchs Achilles heel.

Nutch works fine on older computers, but with the combination of
|parse-(text|html|pdf)
and http.content.limit = -1(needed to get PDF parsing to work) nutch
sometimes freezes completely.

Is there planned any improvement to the parsing of PDF files in the next
version of nutch (0.8)?  

Reply | Threaded
Open this post in threaded view
|

RE: Parsing PDF Nutch Achilles heel?

Steve Betts
There is a bug in the PDF parser tool used with 0.7. You can get a newer
version to replace the jars with the parse-pdf plugin and the freeze will go
away.

Thanks,

Steve Betts
[hidden email]
937-477-1797

-----Original Message-----
From: "Håvard W. Kongsgård" [mailto:[hidden email]]
Sent: Wednesday, January 25, 2006 10:10 AM
To: [hidden email]
Subject: Parsing PDF Nutch Achilles heel?

I have been doing some testing on different nutch configurations to see
what slows down the fetching process on my servers(nutch 0.7.1).
My general experience is that the PDF parse process is nutchs Achilles heel.

Nutch works fine on older computers, but with the combination of
|parse-(text|html|pdf)
and http.content.limit = -1(needed to get PDF parsing to work) nutch
sometimes freezes completely.

Is there planned any improvement to the parsing of PDF files in the next
version of nutch (0.8)?


hk-
Reply | Threaded
Open this post in threaded view
|

Re: Parsing PDF Nutch Achilles heel?

hk-
 From where do I get the new version http://www.pdfbox.org/ or
http://svn.apache.org/viewcvs.cgi/lucene/nutch/



Steve Betts wrote:

>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>version to replace the jars with the parse-pdf plugin and the freeze will go
>away.
>
>Thanks,
>
>Steve Betts
>[hidden email]
>937-477-1797
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>Sent: Wednesday, January 25, 2006 10:10 AM
>To: [hidden email]
>Subject: Parsing PDF Nutch Achilles heel?
>
>I have been doing some testing on different nutch configurations to see
>what slows down the fetching process on my servers(nutch 0.7.1).
>My general experience is that the PDF parse process is nutchs Achilles heel.
>
>Nutch works fine on older computers, but with the combination of
>|parse-(text|html|pdf)
>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>sometimes freezes completely.
>
>Is there planned any improvement to the parsing of PDF files in the next
>version of nutch (0.8)?
>
>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

RE: Parsing PDF Nutch Achilles heel?

Steve Betts
I should have included the link, but I used PDFBox.

Thanks,

Steve Betts
[hidden email]
937-477-1797


-----Original Message-----
From: "Håvard W. Kongsgård" [mailto:[hidden email]]
Sent: Wednesday, January 25, 2006 10:34 AM
To: [hidden email]
Subject: Re: Parsing PDF Nutch Achilles heel?

 From where do I get the new version http://www.pdfbox.org/ or
http://svn.apache.org/viewcvs.cgi/lucene/nutch/



Steve Betts wrote:

>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>version to replace the jars with the parse-pdf plugin and the freeze will
go

>away.
>
>Thanks,
>
>Steve Betts
>[hidden email]
>937-477-1797
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>Sent: Wednesday, January 25, 2006 10:10 AM
>To: [hidden email]
>Subject: Parsing PDF Nutch Achilles heel?
>
>I have been doing some testing on different nutch configurations to see
>what slows down the fetching process on my servers(nutch 0.7.1).
>My general experience is that the PDF parse process is nutchs Achilles
heel.

>
>Nutch works fine on older computers, but with the combination of
>|parse-(text|html|pdf)
>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>sometimes freezes completely.
>
>Is there planned any improvement to the parsing of PDF files in the next
>version of nutch (0.8)?
>
>
>
>
>


hk-
Reply | Threaded
Open this post in threaded view
|

Re: Parsing PDF Nutch Achilles heel?

hk-
PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev...

Steve Betts wrote:

>I should have included the link, but I used PDFBox.
>
>Thanks,
>
>Steve Betts
>[hidden email]
>937-477-1797
>
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>Sent: Wednesday, January 25, 2006 10:34 AM
>To: [hidden email]
>Subject: Re: Parsing PDF Nutch Achilles heel?
>
> From where do I get the new version http://www.pdfbox.org/ or
>http://svn.apache.org/viewcvs.cgi/lucene/nutch/
>
>
>
>Steve Betts wrote:
>
>  
>
>>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>>version to replace the jars with the parse-pdf plugin and the freeze will
>>    
>>
>go
>  
>
>>away.
>>
>>Thanks,
>>
>>Steve Betts
>>[hidden email]
>>937-477-1797
>>
>>-----Original Message-----
>>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>>Sent: Wednesday, January 25, 2006 10:10 AM
>>To: [hidden email]
>>Subject: Parsing PDF Nutch Achilles heel?
>>
>>I have been doing some testing on different nutch configurations to see
>>what slows down the fetching process on my servers(nutch 0.7.1).
>>My general experience is that the PDF parse process is nutchs Achilles
>>    
>>
>heel.
>  
>
>>Nutch works fine on older computers, but with the combination of
>>|parse-(text|html|pdf)
>>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>>sometimes freezes completely.
>>
>>Is there planned any improvement to the parsing of PDF files in the next
>>version of nutch (0.8)?
>>
>>
>>
>>
>>
>>    
>>
>
>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

RE: Parsing PDF Nutch Achilles heel?

Steve Betts
I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster,
but it does allow it to complete.

Thanks,

Steve Betts
[hidden email]
937-477-1797


-----Original Message-----
From: "Håvard W. Kongsgård" [mailto:[hidden email]]
Sent: Wednesday, January 25, 2006 10:49 AM
To: [hidden email]
Subject: Re: Parsing PDF Nutch Achilles heel?

PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev...

Steve Betts wrote:

>I should have included the link, but I used PDFBox.
>
>Thanks,
>
>Steve Betts
>[hidden email]
>937-477-1797
>
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>Sent: Wednesday, January 25, 2006 10:34 AM
>To: [hidden email]
>Subject: Re: Parsing PDF Nutch Achilles heel?
>
> From where do I get the new version http://www.pdfbox.org/ or
>http://svn.apache.org/viewcvs.cgi/lucene/nutch/
>
>
>
>Steve Betts wrote:
>
>
>
>>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>>version to replace the jars with the parse-pdf plugin and the freeze will
>>
>>
>go
>
>
>>away.
>>
>>Thanks,
>>
>>Steve Betts
>>[hidden email]
>>937-477-1797
>>
>>-----Original Message-----
>>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>>Sent: Wednesday, January 25, 2006 10:10 AM
>>To: [hidden email]
>>Subject: Parsing PDF Nutch Achilles heel?
>>
>>I have been doing some testing on different nutch configurations to see
>>what slows down the fetching process on my servers(nutch 0.7.1).
>>My general experience is that the PDF parse process is nutchs Achilles
>>
>>
>heel.
>
>
>>Nutch works fine on older computers, but with the combination of
>>|parse-(text|html|pdf)
>>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>>sometimes freezes completely.
>>
>>Is there planned any improvement to the parsing of PDF files in the next
>>version of nutch (0.8)?
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>


hk-
Reply | Threaded
Open this post in threaded view
|

Re: Parsing PDF Nutch Achilles heel?

hk-
I tried it with PDFBox-0.7.3-dev-20060125-log4j(renamed to
PDFBox-0.7.2-log4j) it worked on some PDFs(50 %)
on the rest "failed with: java.lang.NoClassDefFoundError:
org/fontbox/afm/AFMParser". But in the end nutch again froze.



Steve Betts wrote:

>I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster,
>but it does allow it to complete.
>
>Thanks,
>
>Steve Betts
>[hidden email]
>937-477-1797
>
>
>-----Original Message-----
>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>Sent: Wednesday, January 25, 2006 10:49 AM
>To: [hidden email]
>Subject: Re: Parsing PDF Nutch Achilles heel?
>
>PDFBox-0.7.2 or one of the nightly builds PDFBox-0.7.3-dev...
>
>Steve Betts wrote:
>
>  
>
>>I should have included the link, but I used PDFBox.
>>
>>Thanks,
>>
>>Steve Betts
>>[hidden email]
>>937-477-1797
>>
>>
>>-----Original Message-----
>>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>>Sent: Wednesday, January 25, 2006 10:34 AM
>>To: [hidden email]
>>Subject: Re: Parsing PDF Nutch Achilles heel?
>>
>>From where do I get the new version http://www.pdfbox.org/ or
>>http://svn.apache.org/viewcvs.cgi/lucene/nutch/
>>
>>
>>
>>Steve Betts wrote:
>>
>>
>>
>>    
>>
>>>There is a bug in the PDF parser tool used with 0.7. You can get a newer
>>>version to replace the jars with the parse-pdf plugin and the freeze will
>>>
>>>
>>>      
>>>
>>go
>>
>>
>>    
>>
>>>away.
>>>
>>>Thanks,
>>>
>>>Steve Betts
>>>[hidden email]
>>>937-477-1797
>>>
>>>-----Original Message-----
>>>From: "Håvard W. Kongsgård" [mailto:[hidden email]]
>>>Sent: Wednesday, January 25, 2006 10:10 AM
>>>To: [hidden email]
>>>Subject: Parsing PDF Nutch Achilles heel?
>>>
>>>I have been doing some testing on different nutch configurations to see
>>>what slows down the fetching process on my servers(nutch 0.7.1).
>>>My general experience is that the PDF parse process is nutchs Achilles
>>>
>>>
>>>      
>>>
>>heel.
>>
>>
>>    
>>
>>>Nutch works fine on older computers, but with the combination of
>>>|parse-(text|html|pdf)
>>>and http.content.limit = -1(needed to get PDF parsing to work) nutch
>>>sometimes freezes completely.
>>>
>>>Is there planned any improvement to the parsing of PDF files in the next
>>>version of nutch (0.8)?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>      
>>>
>>
>>
>>
>>    
>>
>
>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: Parsing PDF Nutch Achilles heel?

Doug Cutting-2
In reply to this post by Steve Betts
Steve Betts wrote:
> I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster,
> but it does allow it to complete.

I find xpdf much faster than PDFBox.

http://www.mail-archive.com/nutch-dev@.../msg00161.html

Does this work any better for you?

Doug
hk-
Reply | Threaded
Open this post in threaded view
|

Re: Parsing PDF Nutch Achilles heel?

hk-
Cud you create a new version from the latest xpdf version,
I know that the older versions of pdftotext (before October 2005) had some issues with PDF 1.6 (acrobat 7).



Doug Cutting wrote:

> Steve Betts wrote:
>
>> I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot
>> faster,
>> but it does allow it to complete.
>
>
> I find xpdf much faster than PDFBox.
>
> http://www.mail-archive.com/nutch-dev@.../msg00161.html
>
> Does this work any better for you?
>
> Doug
>

hk-
Reply | Threaded
Open this post in threaded view
|

Re: Parsing PDF Nutch Achilles heel?

hk-
"Cud you create a new version from the latest xpdf version,
I know that the older versions of pdftotext (before October 2005) had
some issues with PDF 1.6 (acrobat 7)."
Sorry my mistake!

Have now tested pdftotext and it's faster than pdfbox, but it doesn't
prevent the nutch freezes.



Håvard W. Kongsgård wrote:

> Cud you create a new version from the latest xpdf version,
> I know that the older versions of pdftotext (before October 2005) had
> some issues with PDF 1.6 (acrobat 7).
>
>
>
> Doug Cutting wrote:
>
>> Steve Betts wrote:
>>
>>> I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot
>>> faster,
>>> but it does allow it to complete.
>>
>>
>>
>> I find xpdf much faster than PDFBox.
>>
>> http://www.mail-archive.com/nutch-dev@.../msg00161.html
>>
>> Does this work any better for you?
>>
>> Doug
>>
>
>