I have seen this project, too. The problem with it is, that it only has
Mappings for the Object definitions as customized DOM objects, but that does
not really help you when importing the text.
TIKA's big advantage is the possibility to use SAX events when importing XML
formats. I am currently working on a patch for the ODF importer, that maps
content.xml's tags to XHTML tags. This can be done very simple by a new SAX
I prepare to post 2 patches to TIKA's issue management system, that:
a) import ODF documents as structured XHTML items as mentioned before.
b) a better conversion of XHTML sax streams to plain text (better than just
only reading characters() events), as the problem here is the difference
between HTML block and span elements. Just reading the element contents
creates whitespace issues...
The same technique could be used for Open XML (Office 2007) items. Using the
new classes of POI is a pain (the same problem: thousands of ne objects from
a really big JAR file that just contains DOM not SAX mappings for Open XML
objects). A clean SAX solution would be preferable.
Just give me some more two days to finish my patches!