external parsers

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

external parsers

Philipp Koch
hi jukka,
you wrote in a recent post:
>We also need to work on growing the community and figuring out how to
best interact with
>external parser projects.
i am currently also doing meta data extraction from various file
formats and got also attracted by the introduction of the tika
project. i found a very interesting image meta data extractor library
which is shipped under apache license but the project itself is not
hosted at apache (see http://www.fightingquaker.com/sanselan/). would
it make sense to ask the project owner(s) of such projects to move to
the apache project, to also make sure that such useful libs will be
maintained and development will continue?

regards, philipp

ps: don't know if this is the right place for such questions....
Reply | Threaded
Open this post in threaded view
|

Re: external parsers

Jukka Zitting
Hi,

On 6/13/07, Philipp Koch <[hidden email]> wrote:
> i am currently also doing meta data extraction from various file
> formats and got also attracted by the introduction of the tika
> project. i found a very interesting image meta data extractor library
> which is shipped under apache license but the project itself is not
> hosted at apache (see http://www.fightingquaker.com/sanselan/).

Looks nice!

> would it make sense to ask the project owner(s) of such projects to
> move to the apache project, to also make sure that such useful libs
> will be maintained and development will continue?

It's up to the external project community to decide if they want to
become an Apache project. We can of course mention the Incubator and
offer to help if they want to bring the project to Apache, but I
wouldn't want to go on a crusade to turn all our dependencies into
Apache projects.

I think the prime criteria on selecting which external libraries to
use as default parsers in Tika (a plugin interface should of course
allow any other libraries to be used instead of the defaults if
needed) would be code quality, licensing, and active maintenance. All
of these are typically well handled by Apache projects, but there's no
inherent rule that external projects couldn't achieve these criteria
just as well or even better than Apache projects.

So, once we have our act together (a working codebase and an
architectural roadmap) I think we should start contacting various
parser projects for cooperation. We should explain what we are trying
to do and preferably have for each parser library we depend on someone
who is following the mailing lists for both Tika and the parser
library in question. While building those bridges we could also
mention the chance of bringing external projects into Apache, but that
definitely shouldn't be a precondition on cooperation.

> ps: don't know if this is the right place for such questions....

Good as any. :-)

BR,

Jukka Zitting