Microsoft Office versions supported by Tika 1.3?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Microsoft Office versions supported by Tika 1.3?

saisantoshi
I am looking at the versions supported by newer version of Tika (1.3) and was not sure what version(s) of the Microsoft office it supports (97/2000/2010/2013) for each of the below?

http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats


Microsoft word
Microsoft Excel
Microsoft PPT

Appreciate if you could point me to any link available that lists out all the supported versions for the above?

Thanks,
Sai.
Reply | Threaded
Open this post in threaded view
|

Re: Microsoft Office versions supported by Tika 1.3?

Nick Burch-2
On Tue, 5 Feb 2013, saisantoshi wrote:
> I am looking at the versions supported by newer version of Tika (1.3)
> and was not sure what version(s) of the Microsoft office it supports
> (97/2000/2010/2013) for each of the below?
>
> http://tika.apache.org/1.3/formats.html#Microsoft_Office_document_formats

That section you reference should tell you all you need to know!

As stated there, the OLE2 formats (.doc, .xls and .ppt) from Office 97 are
supported (but not older office 95 etc), and the newer OOXML based formats
(.xlsx, .pptx, .docx) introduced with Office 2007 (and used by later
versions) are also supported

The parsers pull out all the common text, along with a fair amount of
formatting. It's possible that you may find a kind of text that they don't
currently extract (maybe if it's in some obscure new area of the file used
in the most recent office version, or maybe just in something old but
uncommonly used), in which case you'd need to open an new issue in JIRA +
upload a small file that shows the problem + ideally also upload a small
failing unit test.

Nick