New UIMA annotator based on Tika

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

New UIMA annotator based on Tika

Julien Nioche-4
Hi,

Just to let you know that we've just donated a UIMA component based on Tika
which is used to convert markup into UIMA annotations, extract the text and
metadata etc...
More details on https://issues.apache.org/jira/browse/UIMA-1095

Best,
Julien
--
DigitalPebble Ltd
http://www.digitalpebble.com
Reply | Threaded
Open this post in threaded view
|

Re: New UIMA annotator based on Tika

Jukka Zitting
Hi,

On Mon, Sep 22, 2008 at 10:58 AM, Julien Nioche
<[hidden email]> wrote:
> Just to let you know that we've just donated a UIMA component based on Tika
> which is used to convert markup into UIMA annotations, extract the text and
> metadata etc...

Cool, thanks for sharing!

> More details on https://issues.apache.org/jira/browse/UIMA-1095

I gave a quick look at the code and noticed that you apparently need
to sanitize (clean out control characters, normalize spaces) some of
the parsed text output from Word documents. I guess that's something
that we could and should do already in Tika itself.

BR,

Jukka Zitting