Parser roadmap

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Parser roadmap

Jukka Zitting
Hi,

As you've seen, I've been refactoring the Parser classes quite heavily
for the past few weeks, and now with TIKA-43 I'm reaching a milestone
that already resembles the proposed interface design.

Once TIKA-43 is committed (I'm giving it a day or two for reviews and
comments) there are still two Parser related changes that I'd like to
do before I think we're ready to do the first 0.1 release.

First, I'd like to replace the current Iterable<Content> construct
with a Metadata object that allows metadata to be passed in and out of
the parser. Also, this Metadata object should be decoupled from parser
configuration.

Second, instead of returning the text content of a document as a
String, I'd like the parsers to generate SAX events with the text
content passed as characters() events.

Unless anyone objects (feel free to do so if you have better design
ideas!), I'll follow up with new patches for these two issues in the
next week or two. Once these changes are done, I think we're good to
go for the first Tika release. Such a timing would also be perfect for
the upcoming ApacheCon US conference. :-)

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

chrismattmann
Hi Jukka,

> Once TIKA-43 is committed (I'm giving it a day or two for reviews and
> comments) there are still two Parser related changes that I'd like to
> do before I think we're ready to do the first 0.1 release.

+1, agreed. At present, we've worked through 30 JIRA issues so far (great
job guys!), and I think that the library is reaching stability and is primed
for an official release.

I'll put my name out there as someone available to be the release master
when the time comes. I've done it on Nutch before and wouldn't mind doing it
for Tika. Just let me know if you guys agree.

>
> First, I'd like to replace the current Iterable<Content> construct
> with a Metadata object that allows metadata to be passed in and out of
> the parser. Also, this Metadata object should be decoupled from parser
> configuration.

I completely agree. I'd like to help with this issue as the Metadata
framework is very near and dear to my heart. What's the interface that you
are proposing for it look like again? Something like:

String parse(InputStream stream, Metadata metadata)
             throws IOException, TikaException;


>
> Second, instead of returning the text content of a document as a
> String, I'd like the parsers to generate SAX events with the text
> content passed as characters() events.

Then, the next evolutionary step would be:

SAXEvent parse(InputStream stream, Metadata metadata)
            throws IOException, TikaException;

?

>
> Unless anyone objects (feel free to do so if you have better design
> ideas!), I'll follow up with new patches for these two issues in the
> next week or two. Once these changes are done, I think we're good to
> go for the first Tika release. Such a timing would also be perfect for
> the upcoming ApacheCon US conference. :-)

Totally agree! Great job so far: I am really starting to like this new
Parsing interface...

Cheers,
  Chris

>
> BR,
>
> Jukka Zitting

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Rida Benjelloun
Hi Jukka,
Totally agree with the parser roadmap. Thanks for this good job. I also
agree with replacing Content class by Matadata class, however the metadata
class should not be limited to one metadata standard example DublinCore, I
think that metadata class should be extensible or generic to support
multiple metadata standards.

Regards.

On 10/5/07, Chris Mattmann <[hidden email]> wrote:

>
> Hi Jukka,
>
> > Once TIKA-43 is committed (I'm giving it a day or two for reviews and
> > comments) there are still two Parser related changes that I'd like to
> > do before I think we're ready to do the first 0.1 release.
>
> +1, agreed. At present, we've worked through 30 JIRA issues so far (great
> job guys!), and I think that the library is reaching stability and is
> primed
> for an official release.
>
> I'll put my name out there as someone available to be the release master
> when the time comes. I've done it on Nutch before and wouldn't mind doing
> it
> for Tika. Just let me know if you guys agree.
>
> >
> > First, I'd like to replace the current Iterable<Content> construct
> > with a Metadata object that allows metadata to be passed in and out of
> > the parser. Also, this Metadata object should be decoupled from parser
> > configuration.
>
> I completely agree. I'd like to help with this issue as the Metadata
> framework is very near and dear to my heart. What's the interface that you
> are proposing for it look like again? Something like:
>
> String parse(InputStream stream, Metadata metadata)
>              throws IOException, TikaException;
>
>
> >
> > Second, instead of returning the text content of a document as a
> > String, I'd like the parsers to generate SAX events with the text
> > content passed as characters() events.
>
> Then, the next evolutionary step would be:
>
> SAXEvent parse(InputStream stream, Metadata metadata)
>             throws IOException, TikaException;
>
> ?
>
> >
> > Unless anyone objects (feel free to do so if you have better design
> > ideas!), I'll follow up with new patches for these two issues in the
> > next week or two. Once these changes are done, I think we're good to
> > go for the first Tika release. Such a timing would also be perfect for
> > the upcoming ApacheCon US conference. :-)
>
> Totally agree! Great job so far: I am really starting to like this new
> Parsing interface...
>
> Cheers,
>   Chris
>
> >
> > BR,
> >
> > Jukka Zitting
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> [hidden email]
> Cognizant Development Engineer
> Early Detection Research Network Project
>
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>


--
---------------------------------------------------------
Rida Benjelloun
Doculibre inc.
[hidden email]
[hidden email]
Cel: 418-262-3222
Tel: 418-353-3390
Site Web : http://www.doculibre.com
---------------------------------------------------------
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

chrismattmann
Hi Rida,

[..snip..]
> however the metadata
> class should not be limited to one metadata standard example DublinCore, I
> think that metadata class should be extensible or generic to support
> multiple metadata standards.

The current Metadata class is extensible to support any metadata standard.
The existing interfaces that it implements are meant to be helper tools to
standardize the set of MetKeys when you actually want to use standard
metadata field names: however, it doesn't preclude the use of any Metadata
key field name that you'd like. In other words it supports both:

//example 1
Metadata m = new Metadata();
m.addMetadata(DC_TITLE, "Rida");

Just the same as it supports:

//example 2
Metadata m = new Metadata();
m.addMetadata("your_field_name_here", "Rida");

If it's determined that the set of "your_field_name_here" keys makes sense
and is in widespread use throughout the code, for convenience purposes, we
could create an interface:

public interface MyKeys{
 
  public static final String YOUR_KEY_1 = "my_key_1";

  //...
}

And then have the default Metadata class extend that interface:

public class Metadata implements DublinCore...,MyKeys{
 // rest of code
}

But this isn't a requirement, and should only be done where it makes sense
to. Just wanted to clarify that.

Thanks!

Cheers,
  Chris


______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Jukka Zitting
In reply to this post by chrismattmann
Hi,

On 10/6/07, Chris Mattmann <[hidden email]> wrote:
> I'll put my name out there as someone available to be the release master
> when the time comes. I've done it on Nutch before and wouldn't mind doing it
> for Tika. Just let me know if you guys agree.

+1!

> > First, I'd like to replace the current Iterable<Content> construct
> > with a Metadata object that allows metadata to be passed in and out of
> > the parser. Also, this Metadata object should be decoupled from parser
> > configuration.
>
> I completely agree. I'd like to help with this issue as the Metadata
> framework is very near and dear to my heart. What's the interface that you
> are proposing for it look like again? Something like:
>
> String parse(InputStream stream, Metadata metadata)
>              throws IOException, TikaException;

Exactly.

> > Second, instead of returning the text content of a document as a
> > String, I'd like the parsers to generate SAX events with the text
> > content passed as characters() events.
>
> Then, the next evolutionary step would be:
>
> SAXEvent parse(InputStream stream, Metadata metadata)
>             throws IOException, TikaException;

I'd rather go with:

    void parse(InputStream stream, ContentHandler handler, Metadata metadata)
        throws IOException, SAXException, TikaException;

I.e. the parser invokes a series of callback methods on the given
handler instance. This way the parse result never needs to be
contained in a single object.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Bertrand Delacretaz-2
On 10/7/07, Jukka Zitting <[hidden email]> wrote:

> On 10/6/07, Chris Mattmann <[hidden email]> wrote:
> > I'll put my name out there as someone available to be the release master
> > when the time comes....
>
> +1!

+1 for Chris as our Release Manager!

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Bertrand Delacretaz-2
In reply to this post by Jukka Zitting
On 10/7/07, Jukka Zitting <[hidden email]> wrote:

> ...I'd rather go with:
>
>     void parse(InputStream stream, ContentHandler handler, Metadata metadata)
>         throws IOException, SAXException, TikaException;
>
> I.e. the parser invokes a series of callback methods on the given
> handler instance. This way the parse result never needs to be
> contained in a single object....

Sounds good to me!

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Rida Benjelloun
In reply to this post by Bertrand Delacretaz-2
+1 for Chris as our Release Manager!
Rida

2007/10/10, Bertrand Delacretaz <[hidden email]>:

>
> On 10/7/07, Jukka Zitting <[hidden email]> wrote:
>
> > On 10/6/07, Chris Mattmann <[hidden email]> wrote:
> > > I'll put my name out there as someone available to be the release
> master
> > > when the time comes....
> >
> > +1!
>
> +1 for Chris as our Release Manager!
>
> -Bertrand
>
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Sami Siren-2
In reply to this post by Jukka Zitting
Jukka Zitting wrote:

> I'd rather go with:
>
>     void parse(InputStream stream, ContentHandler handler, Metadata metadata)
>         throws IOException, SAXException, TikaException;
>
> I.e. the parser invokes a series of callback methods on the given
> handler instance. This way the parse result never needs to be
> contained in a single object.

Does this mean Tika users need to implement "parser" (ContentHandler)
that can handle events fired by Tika Parser. One for each format? Or do
we plan to normalize events somehow?

Or is Tika going to provide those handlers for simple tasks like
extracting title + content.


--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Keith R. Bennett
In reply to this post by Rida Benjelloun
I don't know if I officially have a vote yet, but I will continue the unanimity and vote for Chris too!

- Keith

Rida Benjelloun wrote
+1 for Chris as our Release Manager!
Rida
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

robert burrell donkin-2
On 10/10/07, Keith R. Bennett <[hidden email]> wrote:
>
> I don't know if I officially have a vote yet,

everyone has a vote :-)

it's just that only some votes (PMC) are binding upon apache

- robert
Reply | Threaded
Open this post in threaded view
|

Re: Parser roadmap

Jukka Zitting
In reply to this post by Sami Siren-2
Hi,

On 10/10/07, Sami Siren <[hidden email]> wrote:
> Does this mean Tika users need to implement "parser" (ContentHandler)
> that can handle events fired by Tika Parser. One for each format? Or do
> we plan to normalize events somehow?

The main rationale for outputting XML is to be able to express things
like "this is a heading", "this is a link", etc. so that for example a
search engine can put more weight on those parts of the content.

My preference would be to use XHTML Basic as the XML format that the
parsers will output. XHTML is widely known and supported, and is more
than expressive enough for our needs.

> Or is Tika going to provide those handlers for simple tasks like
> extracting title + content.

I would at least have utility adapters that convert the SAX events to
a character stream and further to a single string.

BR,

Jukka Zitting