problems parsing pdf's

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

problems parsing pdf's

Edward Quick




Hi,

I keep getting the following errors when parsing pdf's:

Error parsing: http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/DeT+three+wishes/$FILE/Three+wishes.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary

fetch of http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Uniform+Wearers+Guide/$FILE/BAUWS.pdf failed with: java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage

I have applied the patch mentioned here=>
https://issues.apache.org/jira/browse/NUTCH-643
but this didn't stop the ClassCastExceptions for everything.

Currently I've got about 243 pdfs on our Intranet which I cant get Nutch to parse :-(

Cheers,

Ed.


Try Facebook in Windows Live Messenger! Try it Now!