Edward Quick


My logs have reports of this error several times:

Error parsing:$FILE/ewr.doc: failed(2,0): Can't be handled as Microsoft document. Invalid header signature; read 7015536635646467195, expected -2226271756974174256

I searched for in the mailing list and found the following post

which states:

The reason for failure means that you can't parse these files using the
lib-parsems plugins, because they use a "fast save" format, which is not

Your only option is to use some other external parser through parse-ext

Does that mean if I take out the parse-msword in nutch-site.xml and replace this with parse-ext it should work? Or (I suspect) is it a bit more complicated than that?

Thanks for your help.


