Supported media types per parser

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Supported media types per parser

Jukka Zitting
Hi,

The correct parser for a given document is selected based on the
detected media type of that document. The mapping between media types
and parser classes is kept in entries like the following in the Tika
configuration file.

    <parser name="parse-txt" class="org.apache.tika.parser.txt.TXTParser">
        <mime>text/plain</mime>
    </parser>

The list of <mime/> types per parser is pretty much fixed to the way
the implementation, and I don't see much value in being able to
customize that mapping in the configuration. The downside of having
these mappings in the configuration file is that maintaining
alternative configurations (e.g. ones with extra parsers) need to be
updated whenever the list of media types per parser changes. We've
already seen a number of such mapping changes for example with the
added OOXML support and with the extended image and audio parsers.

To simplify the configuration I'd like to move the media type mappings
from the configuration file to the parser classes. The mapping
information could for example be returned from a parser instance
through a new Parser method like this:

    /**
     * Returns the media types supported by this parser.
     *
     * @return supported media types
     */
    Set<MediaType> getSupportedTypes();

Composite parsers like AutoDetectParser would use this information
instead of explicitly given mappings to dispatch documents to the
correct parsers.

WDYT? Does anyone depend on the ability to explicitly customize the
per-parser media type mappings?

BR,

Jukka Zitting