Towards Tika 0.1 (Was: Re: [jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Towards Tika 0.1 (Was: Re: [jira] Commented: (TIKA-7) Lius Lite remove all lucene dependencies from Lius and use Nutch office parsers)

Jukka Zitting
Hi,

[Taking the discussion to the mailing list.]

On 6/13/07, Chris A. Mattmann (JIRA) <[hidden email]> wrote:
>  One question I have is, have we standardized on the following issues
> (I know they were discussed at ApacheCon at the BoF, as I've seen
> conversation on the dev list regarding it, however, I wasn' there :) ):

We have no firm decisions yet. Everything is open to discussion. :-)

> 1. standardization of Parser interface?
> 2. control flow of Tika parsers

I think both are major open issues to which we don't have a clear
answer yet. There seems to be a shared understanding that we should be
able to come up with a generic architecture that is superior (in terms
of flexibility and extensibility) to any of the existing solutions.
But to get there I think we need to do at least a few design
iterations.

> 3. major features that we want for 0.1 release

I think the rough consensus during ApacheCon EU was that we should go
for a quick (and dirty :-) first release based on the existing
codebases we have. We should label it as a "technology preview" and
make it clear that any or all of the interfaces can and will change in
future releases.

The benefit of doing such an early release is that it would give
people something to play with, and even as-is it would already provide
a generic parsing toolkit that doesn't really exist at the moment.

> 1. I like Bertrand's idea of a pipeline-based Tika framework. I think that
> the "ContentFilter" that he proposes is essentially this Parser interface
> that we are talking about. Immediate questions that come to mind are:

Let's move this to Bertrand's pipeline thread.

> 3. I think that we should plan to have the following features in the 0.1
> release of Tika:
>    a. Basic parsing capability, +1 for using pipelining, but we need to
>        standardize the interfaces for those/talk about architecture
>    b. Content Type identification (e.g., MimeType identification)
>    c. Basic metadata extraction capabilities
>    d. Limited set of known parsing of content types, e.g., HTML, and PDF

Agreed. I think we already have all that functionality in various
pieces, so I would suggest that we first try to merge our existing
code into a somewhat coherent whole (that would then become a baseline
0.1 release), before proceeding with major changes like introducing a
pipeline model.

BR,

Jukka Zitting