Passing context information to parsers

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Passing context information to parsers

Jukka Zitting
Hi,

There are a few cases where it would be useful to be able to pass
generic context information instead of just document metadata to a
parser:

* We use JavaBean conventions for such cases like in DelegatingParser
where the delegate parser is expected to be configured with
setDelegate() before the parse() method is called. In some cases
(OSGi, etc.) it would be more convenient to be able to more
dynamically pass such context information.

* In TIKA-125 it might be better to use a generic context mechanism
instead of document metadata  to pass the default locale to a parser.

* Decryption keys (like in PDFParser) and other similar information
should preferably not be included in the document metadata from where
it would easily end up being stored in search indexes etc.

In all the above cases it would be useful if the passed information
was generic Java objects instead of just strings. And in none of the
cases should the information be included in the output metadata.

To fulfill these requirements I was thinking of adding one more
argument to the parse method, like this:

    /**
     * Parses a document stream into a sequence of XHTML SAX events.
     * Fills in related document metadata in the given metadata object.
     * <p>
     * The given document stream is consumed but not closed by this method.
     * The responsibility to close the stream remains on the caller.
     * <p>
     * Information about the parsing context can be passed in the context
     * parameter. See the parser implementations for the kinds of context
     * information they expect.
     *
     * @param stream the document stream (input)
     * @param handler handler for the XHTML SAX events (output)
     * @param metadata document metadata (input and output)
     * @param context parsing context
     * @throws IOException if the document stream could not be read
     * @throws SAXException if the SAX events could not be processed
     * @throws TikaException if the document could not be parsed
     */
    void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, Map<String, Object> context)
            throws IOException, SAXException, TikaException;

This would be our most notable API change since Tika 0.1 and would
break backwards compatibility unless we keep the current parse()
method available as a deprecated version.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Passing context information to parsers

Michael Wechner
Jukka Zitting schrieb:
>
> This would be our most notable API change since Tika 0.1 and would
> break backwards compatibility unless we keep the current parse()
> method available as a deprecated version.
>  

+1 on keeping the current parse() as deprecated version

(-1 on breaking backwards compatibility)

Cheers

Michael
> BR,
>
> Jukka Zitting
>