I keep getting the following errors when parsing pdf's:
Error parsing: http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/DeT+three+wishes/$FILE/Three+wishes.pdf: failed(2,0): Can't be handled as pdf document. java.lang.ClassCastException: org.pdfbox.pdmodel.encryption.PDEncryptionDictionary
fetch of http://planetba.baplc.com/general/aptrix/aptrix.nsf/AttachmentsByTitle/Uniform+Wearers+Guide/$FILE/BAUWS.pdf failed with: java.lang.NoClassDefFoundError: javax/media/jai/PlanarImage
I have applied the patch mentioned here=> https://issues.apache.org/jira/browse/NUTCH-643 but this didn't stop the ClassCastExceptions for everything.
Currently I've got about 243 pdfs on our Intranet which I cant get Nutch to parse :-(