On Mon, Sep 22, 2008 at 10:58 AM, Julien Nioche
<[hidden email]> wrote:
> Just to let you know that we've just donated a UIMA component based on Tika
> which is used to convert markup into UIMA annotations, extract the text and
> metadata etc...
I gave a quick look at the code and noticed that you apparently need
to sanitize (clean out control characters, normalize spaces) some of
the parsed text output from Word documents. I guess that's something
that we could and should do already in Tika itself.