error parsing Microsoft documents

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

error parsing Microsoft documents

Edward Quick

Hi,

My logs have reports of this error several times:

Error parsing: http://planetba.baplc.com/general/aptrix/aptcsops.nsf/AttachmentsByTitle/ewr/$FILE/ewr.doc: failed(2,0): Can't be handled as Microsoft document. java.io.IOException: Invalid header signature; read 7015536635646467195, expected -2226271756974174256

I searched for in the mailing list and found the following post

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200610.mbox/%3C6911914.post@...%3E

which states:

The reason for failure means that you can't parse these files using the
lib-parsems plugins, because they use a "fast save" format, which is not
supported.

Your only option is to use some other external parser through parse-ext
plugin.



Does that mean if I take out the parse-msword in nutch-site.xml and replace this with parse-ext it should work? Or (I suspect) is it a bit more complicated than that?

Thanks for your help.

Ed.

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/