Extending existing Parsers - No easy to do right now, could we make it easier?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Extending existing Parsers - No easy to do right now, could we make it easier?

Stephane Bastian-3
Hi All,

I finally found some time to send an email and share some thougths on
one of the stickiest issue I had so far with Tika : It's almost not
possible to leverage and override functionality of existing Parsers.I
believe the main reason comes from the fact that the parse method leaves
no room to override existing behavior or provide my own logic. It's
pretty much an all or nothing kind of thing.

For instance, take the Html Parser and lets say I just need to extract
some meta-data not currently handled by Tika. If I'm not mistaken, I
basically have two solutions:

1) Modify the current Html Parser, add code to extract the new metadata
and submit a Patch to Tika
2) Create my own class:
    - do a copy/paste of existing code - The reason for this is that
current parse() method leaves very little room to override existing
behavior or provide my own logic. It's pretty much an all or nothing
kind of thing.
    - add my code
    - register my class so that it's called for a given mimeType

In all the cases I had so far, I simply needed to be able to register my
own ContentHandler on the source document (and not on the structured
content). Unfortunately, it's currently not possible

So, I wanted to know 1) if other people had trouble extending existing
Parser? and 2) if this is an issue we should tackle?

BR,

Stephane Bastian

Reply | Threaded
Open this post in threaded view
|

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Jukka Zitting
Hi,

On Tue, Dec 9, 2008 at 8:27 AM, Stephane Bastian
<[hidden email]> wrote:
> So, I wanted to know 1) if other people had trouble extending existing
> Parser? and 2) if this is an issue we should tackle?

We're of course open to contributions on issues like this, but I'm
wondering if your use case would be better served by directly using
the underlying parser library. If not, how about an extension point
like the one defined in the patch below?

BR,

Jukka Zitting

Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
===================================================================
--- src/main/java/org/apache/tika/parser/html/HtmlParser.java (revision 724309)
+++ src/main/java/org/apache/tika/parser/html/HtmlParser.java (working copy)
@@ -84,6 +84,31 @@

     }

+    /**
+     * Extra handler that can be specified by the client application for
+     * additional processing of raw HTML SAX events generated by NekoHTML.
+     */
+    private ContentHandler extension;
+
+    /**
+     * Returns the configured extension handler.
+     *
+     * @return configured extension handler, or <code>null</code>
+     */
+    public ContentHandler getExtension() {
+        return extension;
+    }
+
+    /**
+     * Sets an extension handler for additional processing of the raw HTML
+     * SAX events generated by the underlying HTML parser.
+     *
+     * @param extension extension handler
+     */
+    public void setExtension(ContentHandler extension) {
+        this.extension = extension;
+    }
+
     public void parse(
             InputStream stream, ContentHandler handler, Metadata metadata)
             throws IOException, SAXException, TikaException {
@@ -102,9 +127,17 @@
                 new MatchingContentHandler(getTitleHandler(metadata), title),
                 new MatchingContentHandler(getMetaHandler(metadata), meta));

+        // Simplify the HTML for Tika clients
+        handler = new XHTMLDowngradeHandler(handler);
+
+        // Add the configured extension, if any
+        if (extension != null) {
+            handler = new TeeContentHandler(handler, extension);
+        }
+
         // Parse the HTML document
         SAXParser parser = new SAXParser();
-        parser.setContentHandler(new XHTMLDowngradeHandler(handler));
+        parser.setContentHandler(handler);
         parser.parse(new InputSource(Utils.getUTF8Reader(stream, metadata)));
     }
Reply | Threaded
Open this post in threaded view
|

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Stephane Bastian-3
Hi Jukka,

This fix would definitely help me in the short run since I've got to
extends the Html parser for my specific needs. However, I'm thinking
that I may run in the same problem with another parser in a month or two.
Therefore I'm leaning toward finding a solution that would work for all
Parsers.

Let me throw an idea here:

Parsing goes through several fairly well defined steps and in the case
of Tika it could be represented as follow:
1) Generate Sax events out of the stream
2) Extracts metadata and save them in an instance of the Metadata class
3) Generate Sax events about the structure of a document

For html pages:
    1) is done by CyberNeko for us. Cyberneko converts an html stream
(which most of the time is *not* well formed) to Sax events
    2) is basically the body of the parse method
    3) is kind of mixed in the body of the parse method

Right now, tika let us interact with 3) and 2) at the cost of an almost
complete rewrite of the parent parser.

How about if we slightly modify Tika to hook custom code to 1) as well.
We could do this by adding an extra ContentHandler to the parse method:

public void parse (InputStream stream, ContentHandler rawHanlder,
ContentHandler structuredHandler, Metadata metadata) ;

Of course, this means modifying the signature of the parse method a bit
and this is not something we want to do if we don't have to
However, I feel the benefits out-weight adding an extra parameter and
provide a way for people to add extra functionality to existing Parsers
very quickly.


As you pointed out, I could also work directly with the parser myself
but in this case I will lose one many benefits of using Tika:
1) Streaming
2) Ability to leverage the MatchingContentHandler which is also working
in streaming mode. BTW, to me this part would probably deserve a project
on its own
3) Shields me from the detail of Parsing a document and converting it to
Sax events (trivial for Html but very handy for other documents such as
MS Office...)

BR,

Stephane Bastian

Jukka Zitting wrote:

> Hi,
>
> On Tue, Dec 9, 2008 at 8:27 AM, Stephane Bastian
> <[hidden email]> wrote:
>  
>> So, I wanted to know 1) if other people had trouble extending existing
>> Parser? and 2) if this is an issue we should tackle?
>>    
>
> We're of course open to contributions on issues like this, but I'm
> wondering if your use case would be better served by directly using
> the underlying parser library. If not, how about an extension point
> like the one defined in the patch below?
>
> BR,
>
> Jukka Zitting
>
> Index: src/main/java/org/apache/tika/parser/html/HtmlParser.java
> ===================================================================
> --- src/main/java/org/apache/tika/parser/html/HtmlParser.java (revision 724309)
> +++ src/main/java/org/apache/tika/parser/html/HtmlParser.java (working copy)
> @@ -84,6 +84,31 @@
>
>      }
>
> +    /**
> +     * Extra handler that can be specified by the client application for
> +     * additional processing of raw HTML SAX events generated by NekoHTML.
> +     */
> +    private ContentHandler extension;
> +
> +    /**
> +     * Returns the configured extension handler.
> +     *
> +     * @return configured extension handler, or <code>null</code>
> +     */
> +    public ContentHandler getExtension() {
> +        return extension;
> +    }
> +
> +    /**
> +     * Sets an extension handler for additional processing of the raw HTML
> +     * SAX events generated by the underlying HTML parser.
> +     *
> +     * @param extension extension handler
> +     */
> +    public void setExtension(ContentHandler extension) {
> +        this.extension = extension;
> +    }
> +
>      public void parse(
>              InputStream stream, ContentHandler handler, Metadata metadata)
>              throws IOException, SAXException, TikaException {
> @@ -102,9 +127,17 @@
>                  new MatchingContentHandler(getTitleHandler(metadata), title),
>                  new MatchingContentHandler(getMetaHandler(metadata), meta));
>
> +        // Simplify the HTML for Tika clients
> +        handler = new XHTMLDowngradeHandler(handler);
> +
> +        // Add the configured extension, if any
> +        if (extension != null) {
> +            handler = new TeeContentHandler(handler, extension);
> +        }
> +
>          // Parse the HTML document
>          SAXParser parser = new SAXParser();
> -        parser.setContentHandler(new XHTMLDowngradeHandler(handler));
> +        parser.setContentHandler(handler);
>          parser.parse(new InputSource(Utils.getUTF8Reader(stream, metadata)));
>      }
>  

Reply | Threaded
Open this post in threaded view
|

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Jukka Zitting
Hi,

On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian
<[hidden email]> wrote:
> Parsing goes through several fairly well defined steps and in the case of
> Tika it could be represented as follow:
> 1) Generate Sax events out of the stream
> 2) Extracts metadata and save them in an instance of the Metadata class
> 3) Generate Sax events about the structure of a document

For many document types steps 1 and 2 are reversed, and 1 and 3 are
actually just a single step. I'm not sure if there's much room for
generalization here.

> How about if we slightly modify Tika to hook custom code to 1) as well. We
> could do this by adding an extra ContentHandler to the parse method:
>
> public void parse (InputStream stream, ContentHandler rawHanlder,
> ContentHandler structuredHandler, Metadata metadata) ;

Most document types simply don't have a "raw" SAX stream, so I don't
think this is a good idea in the general case. The only SAX events you
have are the ones sent to the content handler we have now, so what
you're trying to do could just as well be achieved using a
TeeContentHandler on top of the existing Parser interface.

What I believe you are looking for is a mechanism that would map the
low-level details of all sorts of document types to XML. That's might
be interesting, but I'm not sure if Tika is the best place to do that.
It might be a better idea to approach the parser libraries directly
about a potential SAX mapping, as they are in a much better position
to evaluate how such a mapping should look like and whether
implementing it is reasonable.

> 2) Ability to leverage the MatchingContentHandler which is also working in
> streaming mode. BTW, to me this part would probably deserve a project on its
> own

Thanks, I did think it was a good idea, but it's good to hear that
others like it too. :-)

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Stephane Bastian-3
Hi,

You're definitely right that there would be a mapping between a given
document and XML, via a ContentHandler, which is king of what tika does
already. This also means that metadata would be extracted from the "raw"
ContentHandler.
In any case, as you pointed out Tika might not be the best place to do
this.
However going back to my initial short term issue, which is extending
the Html Parser, I would definitely take the solution you proposed
earlier if it's still on the table ;)

BR,

Stephane Bastian

Jukka Zitting wrote:

> Hi,
>
> On Tue, Dec 9, 2008 at 12:19 PM, Stephane Bastian
> <[hidden email]> wrote:
>  
>> Parsing goes through several fairly well defined steps and in the case of
>> Tika it could be represented as follow:
>> 1) Generate Sax events out of the stream
>> 2) Extracts metadata and save them in an instance of the Metadata class
>> 3) Generate Sax events about the structure of a document
>>    
>
> For many document types steps 1 and 2 are reversed, and 1 and 3 are
> actually just a single step. I'm not sure if there's much room for
> generalization here.
>
>  
>> How about if we slightly modify Tika to hook custom code to 1) as well. We
>> could do this by adding an extra ContentHandler to the parse method:
>>
>> public void parse (InputStream stream, ContentHandler rawHanlder,
>> ContentHandler structuredHandler, Metadata metadata) ;
>>    
>
> Most document types simply don't have a "raw" SAX stream, so I don't
> think this is a good idea in the general case. The only SAX events you
> have are the ones sent to the content handler we have now, so what
> you're trying to do could just as well be achieved using a
> TeeContentHandler on top of the existing Parser interface.
>
> What I believe you are looking for is a mechanism that would map the
> low-level details of all sorts of document types to XML. That's might
> be interesting, but I'm not sure if Tika is the best place to do that.
> It might be a better idea to approach the parser libraries directly
> about a potential SAX mapping, as they are in a much better position
> to evaluate how such a mapping should look like and whether
> implementing it is reasonable.
>
>  
>> 2) Ability to leverage the MatchingContentHandler which is also working in
>> streaming mode. BTW, to me this part would probably deserve a project on its
>> own
>>    
>
> Thanks, I did think it was a good idea, but it's good to hear that
> others like it too. :-)
>
> BR,
>
> Jukka Zitting
>  

Reply | Threaded
Open this post in threaded view
|

Re: Extending existing Parsers - No easy to do right now, could we make it easier?

Jukka Zitting
Hi,

On Tue, Dec 9, 2008 at 1:04 PM, Stephane Bastian
<[hidden email]> wrote:
> In any case, as you pointed out Tika might not be the best place to do this.
> However going back to my initial short term issue, which is extending the
> Html Parser, I would definitely take the solution you proposed earlier if
> it's still on the table ;)

I thought about this a bit more (see TIKA-182), and I must say that
I'd rather not apply the patch to Tika. Doing so would create an extra
binding between client code and the underlying parser library, and
would make it difficult for us to later replace the parser if we
wanted to.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

RE: Extending existing Parsers - No easy to do right now, could we make it easier?

Uwe Schindler
In my opinion, if somebody wants such a specialized parser with his own
optimizations, he could simply write his own parser using nekohtml and plug
into TIKA.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [hidden email]

> -----Original Message-----
> From: Jukka Zitting [mailto:[hidden email]]
> Sent: Tuesday, December 16, 2008 12:07 AM
> To: [hidden email]
> Subject: Re: Extending existing Parsers - No easy to do right now, could
> we make it easier?
>
> Hi,
>
> On Tue, Dec 9, 2008 at 1:04 PM, Stephane Bastian
> <[hidden email]> wrote:
> > In any case, as you pointed out Tika might not be the best place to do
> this.
> > However going back to my initial short term issue, which is extending
> the
> > Html Parser, I would definitely take the solution you proposed earlier
> if
> > it's still on the table ;)
>
> I thought about this a bit more (see TIKA-182), and I must say that
> I'd rather not apply the patch to Tika. Doing so would create an extra
> binding between client code and the underlying parser library, and
> would make it difficult for us to later replace the parser if we
> wanted to.
>
> BR,
>
> Jukka Zitting