Content-type detection for Tika

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Content-type detection for Tika

Jukka Zitting
Hi,

I'm thinking about implementing the (draft) shared MIME database spec
[1] from freedesktop.org in Tika as a modern MIME magic implementation
to help automatically detect and handle the types of resources where
insufficient typing metadata is available. The specified typing
information also includes an inheritance model which allows for
automatic failover to more generic parsers (e.g. from image/svg to
text/xml) when specific parser plugins are not available.

I know that the Java Activation Framework has some of this
functionality and that there are a few MIME magic libraries for Java
available, but my understanding is that all of these are either not
too accurate or unusable in Apache projects due to GPL licensing. I
would also like to add an extension point where available parser
plugins could register even more accurate custom type detection
components.

Is such functionality already included or planned in Nutch? Any
thoughts, comments or pointers to better get me started?

One drawback of the freedesktop.org spec is that their standard MIME
type database is GPL licensed so I can't include that directly in the
project, but all the major Linux distributions seem to be adopting the
standard so the database should be available at least on those
platforms without manual installation.

[1] http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [hidden email]
Software craftsmanship, JCR consulting, and Java development
Reply | Threaded
Open this post in threaded view
|

Re: Content-type detection for Tika

Jérôme Charron
>
>
> I'm thinking about implementing the (draft) shared MIME database spec
> [1] from freedesktop.org in Tika as a modern MIME magic implementation
> to help automatically detect and handle the types of resources where
> insufficient typing metadata is available. The specified typing
> information also includes an inheritance model which allows for
> automatic failover to more generic parsers (e.g. from image/svg to
> text/xml) when specific parser plugins are not available.

I already have such code for Nutch (freedesktop based content-type
detection).
These days, I have no more time to spend on Nutch, but I can send you the
code.
Please contact me on my private mail.

Regards

Jérôme