Questions about java TIKA project.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions about java TIKA project.

A Z-2

//-----------------------------------------------------------------------------------
I notice that the java TIKA project is for file format support using java
and various Office file formats.

I also notice that you are building on POI (presumably 3.9).

-POI has shortfalls around HWPFDocument objects; Microsoft Word
 .doc files. One may not really easily insert

org.apache.poi.hwpf.usermodel.Picture


objects into the document and save it with success.

setFtcAscii(int ftcAscii)

setFtcFE(int ftcFE)

functions don't make it easy to alter Font information in an HWPFDocument,
with their intended names, certainly the int values for font, by no means
evident as they aren't included as fields in a companion class.


-Is your project about addressing these sorts of shortfalls inside POI?
//-----------------------------------------------------------------------------------

-Similarly, I want more support for dealing with *.rtf files. Particularly
 to insert text and images, and not simply append them. I also want the ability
to read images out of *.rtf files too.  Are these going to be dealt with?
//-----------------------------------------------------------------------------------
     
Reply | Threaded
Open this post in threaded view
|

Re: Questions about java TIKA project.

Nick Burch-2
On Thu, 7 Mar 2013, A Z wrote:
> I also notice that you are building on POI (presumably 3.9).
>
> -POI has shortfalls around HWPFDocument objects; Microsoft Word
>  .doc files. One may not really easily insert
>
> org.apache.poi.hwpf.usermodel.Picture

Apache Tika only reads files in through the various libraries it uses, so
write/change support in libraries like Apache POI don't affect Tika.

If these limitations in POI do affect you, then the best bet is to ask for
advice from the Apache POI community, and work up patches to add in the
missing features!


> -Similarly, I want more support for dealing with *.rtf files. Particularly
>  to insert text and images, and not simply append them.

Again, Tika is only interested in reading data out of RTF formats, not
making changes to them, so that sort of thing is out of scope

Nick