PDF Parse Error

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

PDF Parse Error

Richard Braman
I get the following errors regarding pdf:
 
060228 160518 fetch okay, but can't parse
http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_hi
.pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
can't handle incomplete pdf file.
 
060228 160354 fetch okay, but can't parse
http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
failed(2,0): Can't be handled as pdf document.
java.lang.NullPointerException
 
060228 160518 fetch okay, but can't parse
http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Instru
ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
java.io.IOException: You do not have permission to extract text
 
I have a number of errors like this in my log, mostly the content
truncated one.
 
The thing is these files all open fine in acrobat.
 
 

Richard Braman
mailto:[hidden email]
561.748.4002 (voice)

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
Free Open Source Tax Software

 
Reply | Threaded
Open this post in threaded view
|

Re: PDF Parse Error

Jeff Ritchie
In nutch-site.xml
Set it to something like

<property>
<name>http.content.limit</name>
<value>655360</value>
</property>

Jeff.


Richard Braman wrote:

>I get the following errors regarding pdf:
>
>060228 160518 fetch okay, but can't parse
>http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_hi
>.pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
>can't handle incomplete pdf file.
>
>060228 160354 fetch okay, but can't parse
>http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>failed(2,0): Can't be handled as pdf document.
>java.lang.NullPointerException
>
>060228 160518 fetch okay, but can't parse
>http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Instru
>ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
>java.io.IOException: You do not have permission to extract text
>
>I have a number of errors like this in my log, mostly the content
>truncated one.
>
>The thing is these files all open fine in acrobat.
>
>
>
>Richard Braman
>mailto:[hidden email]
>561.748.4002 (voice)
>
>http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
>Free Open Source Tax Software
>
>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

RE: PDF Parse Error

Richard Braman
I set it to 0, there are some big pdfs on the sites I am crawlign.
Thanks Jeff.

-----Original Message-----
From: Jeff Ritchie [mailto:[hidden email]]
Sent: Tuesday, February 28, 2006 4:37 PM
To: [hidden email]
Subject: Re: PDF Parse Error


In nutch-site.xml
Set it to something like

<property>
<name>http.content.limit</name>
<value>655360</value>
</property>

Jeff.


Richard Braman wrote:

>I get the following errors regarding pdf:
>
>060228 160518 fetch okay, but can't parse
>http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_h
>i
>.pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
>can't handle incomplete pdf file.
>
>060228 160354 fetch okay, but can't parse
>http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>failed(2,0): Can't be handled as pdf document.
>java.lang.NullPointerException
>
>060228 160518 fetch okay, but can't parse
>http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Instr
>u
>ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
>java.io.IOException: You do not have permission to extract text
>
>I have a number of errors like this in my log, mostly the content
>truncated one.
>
>The thing is these files all open fine in acrobat.
>
>
>
>Richard Braman
>mailto:[hidden email]
>561.748.4002 (voice)
>
>http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
>Free Open Source Tax Software
>
>
>
>  
>

Reply | Threaded
Open this post in threaded view
|

Re: PDF Parse Error

Andrzej Białecki-2
In reply to this post by Richard Braman
Richard Braman wrote:
> I get the following errors regarding pdf:
>  

Please do not cross-post to multiple groups. Your questions are suitable
for nutch-user alone.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: PDF Parse Error

Richard Braman
I am telling you I have all of the content size limits all set to 0,
which I thought meant no truncation.  I was getting lots
of truncation anyway in PDF files.  I reread my config and notcied the
easily missed detail that file and ftp are 0 from no trunccation, but
http need be -1


Here is the help I got in nutch-user from Jermoe, who I noticed is a
developer.

>Edit your nutch-site.xml (or nutch-default.xml) and change the
http.content.limit (set it to 0 if you don't want no content truncation
at >all).

>Jérôme



This is very inconsistant, and unless theres a reason for it it should
be changed for the next version I think.  Otherwise it becomes a support
problem.  This is so easy to miss that one of you developers missed it.

This is sound like a bug, something suitable for nutch-dev I think.


Site config that works for no truncation:
<property>
  <name>file.content.limit</name>
  <value>0</value>
</property>

<property>
  <name>ftp.content.limit</name>
  <value>0</value>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
</property>



Reply | Threaded
Open this post in threaded view
|

Re: PDF Parse Error

Andrzej Białecki-2
Richard Braman wrote:
> This is sound like a bug, something suitable for nutch-dev I think.
>
>  

Yes, but please do not cross-post - many of us are subscribed to both
groups, and we're getting multiple copies of your posts...

I agree, this is inconsistent and should be changed. I think all places
should use -1 as a "magic" value, because it's obviously invalid.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: PDF Parse Error

Jérôme Charron
> Yes, but please do not cross-post - many of us are subscribed to both
> groups, and we're getting multiple copies of your posts...

+1

I agree, this is inconsistent and should be changed. I think all places
> should use -1 as a "magic" value, because it's obviously invalid.

 +1
Richard, could you please create a jira issue about this.
Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/
Reply | Threaded
Open this post in threaded view
|

Re: [PDFBox-user] PDF Parse Error

Ben Litchfield-3
In reply to this post by Richard Braman

I believe these errors are due to a parsing bug in PDFBox that has been
fixed since the 0.7.2 release.  Please give the nightly build(should be a
drop in replacement) a try from http://www.pdfbox.org/dist and let me know
if you are still having issues.

Ben



On Tue, 28 Feb 2006, Richard Braman wrote:

> I get the following errors regarding pdf:
>
> 060228 160518 fetch okay, but can't parse
> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_hi
> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
> can't handle incomplete pdf file.
>
> 060228 160354 fetch okay, but can't parse
> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> failed(2,0): Can't be handled as pdf document.
> java.lang.NullPointerException
>
> 060228 160518 fetch okay, but can't parse
> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Instru
> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
> java.io.IOException: You do not have permission to extract text
>
> I have a number of errors like this in my log, mostly the content
> truncated one.
>
> The thing is these files all open fine in acrobat.
>
>
>
> Richard Braman
> mailto:[hidden email]
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/>
> Free Open Source Tax Software
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: [PDFBox-user] PDF Parse Error

Richard Braman
Hi Bn,

We actually got to the bottom of all of them except for 1... The content
truncatetion was due to an inconsistancy bug in nutch config .  
The no permission to extract text is actually true, the author, the NC
Department of revenue put this restriction on all of their files (I have
asked them to remove it as it hampers public accessability).  The Null
pointer exception is the only one to deal with that may be due to the
parsing bug .  Is this one that you are referring to?

-----Original Message-----
From: Ben Litchfield [mailto:[hidden email]]
Sent: Thursday, March 02, 2006 4:07 PM
To: Richard Braman
Cc: [hidden email]; [hidden email];
[hidden email]
Subject: Re: [PDFBox-user] PDF Parse Error



I believe these errors are due to a parsing bug in PDFBox that has been
fixed since the 0.7.2 release.  Please give the nightly build(should be
a drop in replacement) a try from http://www.pdfbox.org/dist and let me
know if you are still having issues.

Ben



On Tue, 28 Feb 2006, Richard Braman wrote:

> I get the following errors regarding pdf:
>
> 060228 160518 fetch okay, but can't parse
> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_
> hi
> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
> can't handle incomplete pdf file.
>
> 060228 160354 fetch okay, but can't parse
> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
> failed(2,0): Can't be handled as pdf document.
> java.lang.NullPointerException
>
> 060228 160518 fetch okay, but can't parse
> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Inst
> ru
> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
> java.io.IOException: You do not have permission to extract text
>
> I have a number of errors like this in my log, mostly the content
> truncated one.
>
> The thing is these files all open fine in acrobat.
>
>
>
> Richard Braman
> mailto:[hidden email]
> 561.748.4002 (voice)
>
> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free
> Open Source Tax Software
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [PDFBox-user] PDF Parse Error

Ben Litchfield
Yes, the NPE should be fixed.

 Ben

Richard Braman wrote:

> Hi Bn,
>
> We actually got to the bottom of all of them except for 1... The content
> truncatetion was due to an inconsistancy bug in nutch config .  
> The no permission to extract text is actually true, the author, the NC
> Department of revenue put this restriction on all of their files (I have
> asked them to remove it as it hampers public accessability).  The Null
> pointer exception is the only one to deal with that may be due to the
> parsing bug .  Is this one that you are referring to?
>
> -----Original Message-----
> From: Ben Litchfield [mailto:[hidden email]]
> Sent: Thursday, March 02, 2006 4:07 PM
> To: Richard Braman
> Cc: [hidden email]; [hidden email];
> [hidden email]
> Subject: Re: [PDFBox-user] PDF Parse Error
>
>
>
> I believe these errors are due to a parsing bug in PDFBox that has been
> fixed since the 0.7.2 release.  Please give the nightly build(should be
> a drop in replacement) a try from http://www.pdfbox.org/dist and let me
> know if you are still having issues.
>
> Ben
>
>
>
> On Tue, 28 Feb 2006, Richard Braman wrote:
>
>  
>> I get the following errors regarding pdf:
>>
>> 060228 160518 fetch okay, but can't parse
>> http://taxpros.marylandtaxes.com/publications/revenews/archives/spr05_
>> hi
>> .pdf, reason: failed(2,202): Content truncated at 66005 bytes. Parser
>> can't handle incomplete pdf file.
>>
>> 060228 160354 fetch okay, but can't parse
>> http://www.mstc.state.ms.us/info/stats/transfer/tran0704.pdf, reason:
>> failed(2,0): Can't be handled as pdf document.
>> java.lang.NullPointerException
>>
>> 060228 160518 fetch okay, but can't parse
>> http://www.dor.state.nc.us/downloads/corp_archive/03archive/NC478_Inst
>> ru
>> ctions.pdf, reason: failed(2,0): Can't be handled as pdf document.
>> java.io.IOException: You do not have permission to extract text
>>
>> I have a number of errors like this in my log, mostly the content
>> truncated one.
>>
>> The thing is these files all open fine in acrobat.
>>
>>
>>
>> Richard Braman
>> mailto:[hidden email]
>> 561.748.4002 (voice)
>>
>> http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> Free
>> Open Source Tax Software
>>
>>
>>
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

RE: PDF Parse Error

Richard Braman
In reply to this post by Jérôme Charron
https://issues.apache.org/jira/browse/NUTCH-219

-----Original Message-----
From: Jérôme Charron [mailto:[hidden email]]
Sent: Thursday, March 02, 2006 5:41 AM
To: [hidden email]
Subject: Re: PDF Parse Error


> Yes, but please do not cross-post - many of us are subscribed to both
> groups, and we're getting multiple copies of your posts...

+1

I agree, this is inconsistent and should be changed. I think all places
> should use -1 as a "magic" value, because it's obviously invalid.

 +1
Richard, could you please create a jira issue about this. Thanks

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/