Pushing parsers upstream

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Pushing parsers upstream

Jukka Zitting
Hi,

As you know, we see a lot of questions about version mismatches (which
POI or PDFBox version should go with this Tika version) and there's a
long queue of patches that are waiting for new official releases of
our upstream dependencies to become available.

To avoid this issue I propose that we start moving some of our parser
implementations to upstream projects. Now with Tika 1.0 out we have a
stable Parser and Detector interfaces and related APIs that upstream
libraries could implement directly without us having to worry about
changing Tika code whenever a new version of a parser library becomes
available.

This would allow our users to for example directly upgrade to a new
POI version without waiting for a releated Tika release first.
Similarly, a new PDF parsing option or improvement could be
implemented directly in PDFBox and be usable without any code changes
in Tika.

The classloading and OSGi service mechanisms we've added should make
such upstream Parser implementations trivially easy to use, and we
could still keep the dependencies in tika-parsers as a way to pull in
the libraries even if the relevant implementation classes would no
longer reside in org.apache.tika.parsers.*.

In addition to some of the GPL libraries for which we've already done
this, I recently took the liberty of trying this out also with PDFBox.
See PDFBOX-1132 [1] for the issue where I copied the
org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
without problems, so now I'd like to propose that we copy any more
recent PDF parser changes to PDFBox and prepare to drop the parser
implementation in tika-parsers. Any further PDF parser work should
then be done directly in PDFBox. I haven't yet talked about this with
the PDFBox PMC (of which I'm a member), but I suppose we should be
able to come up with an arrangement where Tika committers can commit
directly to the Tika parser implementation in PDFBox.

It would be cool if we could do the same thing also with POI.

WDYT?

[1] https://issues.apache.org/jira/browse/PDFBOX-1132

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Nick Burch-4
On Tue, 13 Dec 2011, Jukka Zitting wrote:
> To avoid this issue I propose that we start moving some of our parser
> implementations to upstream projects. Now with Tika 1.0 out we have a
> stable Parser and Detector interfaces and related APIs that upstream
> libraries could implement directly without us having to worry about
> changing Tika code whenever a new version of a parser library becomes
> available.

A couple of issues do spring to mind with this plan:
* Metadata keys - if a parser enhancement or new feature needs a new
   metadata key, then you end up having to wait for a new tika release to
   get it (so you can add the code to use it to release)
* Consistency - both or markup and metadata keys will be harder to
   ensure when it isn't in the same codebase

For detectors, there's extra issue here. At the moment, both the Zip and
OLE2 detectors handle more than just the POI formats, and in the Zip case
rely on code shared between the parsers (poi+keynote) and detector. How
would this work if the container detectors were handed to POI? And who's
job would it be to test it?

That's a general thing actually, how much testing would need to remain on
the Tika side?

Oh, but I guess this counts as your answer on what I should be doing with
my Ogg Vorbis parser :)

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Antoni Mylka-2
W dniu 2011-12-13 12:23, Nick Burch pisze:

> On Tue, 13 Dec 2011, Jukka Zitting wrote:
>> To avoid this issue I propose that we start moving some of our parser
>> implementations to upstream projects. Now with Tika 1.0 out we have a
>> stable Parser and Detector interfaces and related APIs that upstream
>> libraries could implement directly without us having to worry about
>> changing Tika code whenever a new version of a parser library becomes
>> available.
>
> A couple of issues do spring to mind with this plan:
> * Metadata keys - if a parser enhancement or new feature needs a new
> metadata key, then you end up having to wait for a new tika release to
> get it (so you can add the code to use it to release)

What's wrong with using plain strings in upstream parsers, until
appropriate constants in TikaMetadataKeys become available?

> * Consistency - both or markup and metadata keys will be harder to
> ensure when it isn't in the same codebase

Probably, though benefits are huge.

> For detectors, there's extra issue here. At the moment, both the Zip and
> OLE2 detectors handle more than just the POI formats, and in the Zip
> case rely on code shared between the parsers (poi+keynote) and detector.
> How would this work if the container detectors were handed to POI? And
> who's job would it be to test it?

The same people who now rely on it - the community, helped by a detailed
test suite.

> That's a general thing actually, how much testing would need to remain
> on the Tika side?

Dunno. There is no official policy in this regard, is there? ASF makes
guarantees that a release is OK from the legal POV, has reasonable
(released, available, proper license) dependencies and that the unit
tests pass. Regression testing is done on a "best effort" basis anyway
and from my POV there is no difference in effort whether the detectors
are in POI or in Tika. Is there any?

In Aperture we sidestepped this problem by pushing non-released versions
of POI, PDFBox and other libraries to our own repository and depending
on them. Sometimes these were "vanilla" trunks, sometimes trunks with my
patches. See for instance

http://aperture.sourceforge.net/maven/org/apache/poi/poi/

This would clearly work for an "internal" project, but didn't work too
well for an open source project. It also takes lots of work.

With Tika such a solution is impossible for a number of reasons and
pushing parsers upstream sounds like a great alternative:
  * a way to allow for such cherry-picking of dependency trunks to take
place in-house, when need arises, without the need to do it in public.
  * a way to ensure "graceful degradation" of Tika functionality when
the libraries are missing, without ugly ClassNotFoundErrors. (probably
the only reliable way).

I'm all for.

Antoni Mylka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Mattmann, Chris A (3010)
In reply to this post by Jukka Zitting
Hey Jukka,

For places like POI and PDFBox I think this could definitely work. And then for
places where we have Parsers, but aren't ready to push upstream yet (I can
think of two examples of this relevant to me, NetCDF/HDF and GDAL),
we can just leave the Parser in tika-parsers I think.

In this manner, what you're really suggesting is that it would be great for
our mature Parsers to be "promoted" upstream to the communities that
really understand the underlying Parser implementation toolkit. I think
this makes sense to me, so long as there is a Champion or someone in
that community willing to spend the small amount of time to learn Tika
and its interfaces (if they haven't done so already).

The net effect to the casual Tika user is nil, since we have Parser loading via
service factories, and the only thing that'll change there is the package
name (and potentially the class name) but it's all behind the scenes.
The net effect to the Tika developer is that the class and package name
changes may cause folks to have to recompile code/etc., and the
code/unit tests/maintenance of some of the parsers would no longer
be readily available in Tika's tika-parsers artifact, but would live
in the tika-parser dependency library upstream.

Cheers,
Chris

On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote:

> Hi,
>
> As you know, we see a lot of questions about version mismatches (which
> POI or PDFBox version should go with this Tika version) and there's a
> long queue of patches that are waiting for new official releases of
> our upstream dependencies to become available.
>
> To avoid this issue I propose that we start moving some of our parser
> implementations to upstream projects. Now with Tika 1.0 out we have a
> stable Parser and Detector interfaces and related APIs that upstream
> libraries could implement directly without us having to worry about
> changing Tika code whenever a new version of a parser library becomes
> available.
>
> This would allow our users to for example directly upgrade to a new
> POI version without waiting for a releated Tika release first.
> Similarly, a new PDF parsing option or improvement could be
> implemented directly in PDFBox and be usable without any code changes
> in Tika.
>
> The classloading and OSGi service mechanisms we've added should make
> such upstream Parser implementations trivially easy to use, and we
> could still keep the dependencies in tika-parsers as a way to pull in
> the libraries even if the relevant implementation classes would no
> longer reside in org.apache.tika.parsers.*.
>
> In addition to some of the GPL libraries for which we've already done
> this, I recently took the liberty of trying this out also with PDFBox.
> See PDFBOX-1132 [1] for the issue where I copied the
> org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
> without problems, so now I'd like to propose that we copy any more
> recent PDF parser changes to PDFBox and prepare to drop the parser
> implementation in tika-parsers. Any further PDF parser work should
> then be done directly in PDFBox. I haven't yet talked about this with
> the PDFBox PMC (of which I'm a member), but I suppose we should be
> able to come up with an arrangement where Tika committers can commit
> directly to the Tika parser implementation in PDFBox.
>
> It would be cool if we could do the same thing also with POI.
>
> WDYT?
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-1132
>
> BR,
>
> Jukka Zitting


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Michael McCandless-2
+0
I agree, logically, parsers "belong" with their upstream project,since
as that project improves how the document format is cracked,they can
also make the matching fixes to Tika's parser.  As long asthere's
enough love / advocate / testing for the Tika parser in thatproject...
My only concern is the possible added latency in getting
parser-onlyfixes out to Tika's users.
Ie, once a parser is upstream, if there's a fix that would onlyrequire
a change to the parser's source code (say we open up controlover
another PDFBox option, or workaround an issue in PDFBox), PDFBoxmust
fix it, then release, then Tika must upgrade, then Tika mustrelease.
It's true users could directly upgrade their PDFBox w/owaiting for a
Tika release but I suspect most users don't do that...
Vs today, where we just fix & release Tika directly.
Would it somehow be possible for Tika to ship an unreleased PDFBox?Or
does Maven fully tie our hands here?
Mike McCandless

http://blog.mikemccandless.com

On Tue, Dec 13, 2011 at 10:16 AM, Mattmann, Chris A (388J)
<[hidden email]> wrote:

> Hey Jukka,
>
> For places like POI and PDFBox I think this could definitely work. And then for
> places where we have Parsers, but aren't ready to push upstream yet (I can
> think of two examples of this relevant to me, NetCDF/HDF and GDAL),
> we can just leave the Parser in tika-parsers I think.
>
> In this manner, what you're really suggesting is that it would be great for
> our mature Parsers to be "promoted" upstream to the communities that
> really understand the underlying Parser implementation toolkit. I think
> this makes sense to me, so long as there is a Champion or someone in
> that community willing to spend the small amount of time to learn Tika
> and its interfaces (if they haven't done so already).
>
> The net effect to the casual Tika user is nil, since we have Parser loading via
> service factories, and the only thing that'll change there is the package
> name (and potentially the class name) but it's all behind the scenes.
> The net effect to the Tika developer is that the class and package name
> changes may cause folks to have to recompile code/etc., and the
> code/unit tests/maintenance of some of the parsers would no longer
> be readily available in Tika's tika-parsers artifact, but would live
> in the tika-parser dependency library upstream.
>
> Cheers,
> Chris
>
> On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote:
>
>> Hi,
>>
>> As you know, we see a lot of questions about version mismatches (which
>> POI or PDFBox version should go with this Tika version) and there's a
>> long queue of patches that are waiting for new official releases of
>> our upstream dependencies to become available.
>>
>> To avoid this issue I propose that we start moving some of our parser
>> implementations to upstream projects. Now with Tika 1.0 out we have a
>> stable Parser and Detector interfaces and related APIs that upstream
>> libraries could implement directly without us having to worry about
>> changing Tika code whenever a new version of a parser library becomes
>> available.
>>
>> This would allow our users to for example directly upgrade to a new
>> POI version without waiting for a releated Tika release first.
>> Similarly, a new PDF parsing option or improvement could be
>> implemented directly in PDFBox and be usable without any code changes
>> in Tika.
>>
>> The classloading and OSGi service mechanisms we've added should make
>> such upstream Parser implementations trivially easy to use, and we
>> could still keep the dependencies in tika-parsers as a way to pull in
>> the libraries even if the relevant implementation classes would no
>> longer reside in org.apache.tika.parsers.*.
>>
>> In addition to some of the GPL libraries for which we've already done
>> this, I recently took the liberty of trying this out also with PDFBox.
>> See PDFBOX-1132 [1] for the issue where I copied the
>> org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
>> without problems, so now I'd like to propose that we copy any more
>> recent PDF parser changes to PDFBox and prepare to drop the parser
>> implementation in tika-parsers. Any further PDF parser work should
>> then be done directly in PDFBox. I haven't yet talked about this with
>> the PDFBox PMC (of which I'm a member), but I suppose we should be
>> able to come up with an arrangement where Tika committers can commit
>> directly to the Tika parser implementation in PDFBox.
>>
>> It would be cool if we could do the same thing also with POI.
>>
>> WDYT?
>>
>> [1] https://issues.apache.org/jira/browse/PDFBOX-1132
>>
>> BR,
>>
>> Jukka Zitting
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: [hidden email]
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Antoni Mylka-2
W dniu 2011-12-13 18:05, Michael McCandless pisze:
> Would it somehow be possible for Tika to ship an unreleased PDFBox?Or
> does Maven fully tie our hands here?

That's the issue. Would it? AFAIU it's impossible. Tika can only depend
on jars in maven central. Is it possible to push a snapshot jar to maven
central (and label it with a version number which includes the date or
something). There are such jars, but how does it look in practice? Who
decides if a jar can or cannot be uploaded?

Antoni Mylka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Jukka Zitting
In reply to this post by Nick Burch-4
Hi,

On Tue, Dec 13, 2011 at 12:23 PM, Nick Burch <[hidden email]> wrote:
> A couple of issues do spring to mind with this plan:

Good points.

> * Metadata keys - if a parser enhancement or new feature needs a new
>  metadata key, then you end up having to wait for a new tika release to
>  get it (so you can add the code to use it to release)

As mentioned by Antoni, in the end the metadata keys are just strings,
so with a little coordination we don't need to delay the introduction
of new keys over multiple releases.

More generally though, I think it would make sense over time to have
tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM,
etc.) that aren't directly tied to any specific parser or file format.
Format-specific keys like the ones we now have in the MSOffice
interface would be better kept next to the actual parser
implementation. That way, as long as the generic metadata keys in
tika-core are more or less complete (i.e. cover all of the key
metadata standards), there should be little need for a parser
implementation to need changes in the rest of Tika if it wants to
introduce a new custom metadata key.

> * Consistency - both or markup and metadata keys will be harder to
>  ensure when it isn't in the same codebase

Yep, that can be a problem. I guess the ultimate solution to this
would be to come up with a well documented definition of what a parser
should ideally output for specific kinds of content, but that's quite
a bit of work.

A partial solution could be the kind of shared committership model I
was proposing. Then a single committer who wants to increase the level
of consistency should be able to do so without worrying about karma
boundaries.

> For detectors, there's extra issue here. At the moment, both the Zip and
> OLE2 detectors handle more than just the POI formats, and in the Zip case
> rely on code shared between the parsers (poi+keynote) and detector. How
> would this work if the container detectors were handed to POI?

I guess this would require some level of code duplication, i.e. having
a Zip detector in POI that knows about OOXML types, and another in
tika-parsers that knows about other types of Zips.

> And who's job would it be to test it? That's a general thing actually, how
> much testing would need to remain on the Tika side?

I'd still have the upstream libraries as dependencies of tika-parsers,
and we definitely should continue maintaining a good set of
integration tests there. On the other hand we already have many tests
that actually test against issues in upstream parser libraries instead
of any code in Tika, and I think those tests would be better located
in the upstream projects. Ultimately test cases should go with the
issues where particular problems or wishes were expressed.

> Oh, but I guess this counts as your answer on what I should be doing with my
> Ogg Vorbis parser :)

:-) Yep, in a way.

From the beginning the idea behind Tika is that we should focus on
being a thin integration layer on top of existing parser libraries.
The fact that we're now implementing quite a few parsers by ourselves
and the large amount of code we use to wrap especially POI and to a
lesser degree PDFBox is a bit of a concern to me. We could and should
be pushing more of this work to places where it would be useful also
to people who aren't using Tika.

There are many people who'd likely benefit from for example a good RTF
or Ogg Vorbis parser but who don't really need Tika. Being able to get
such people to use and contribute to the code we've written would
indirectly help also Tika. Attracting such users and contributions is
hard if the code lives only inside Tika.

Similarly many bits and pieces in especially our bigger parser classes
like those for POI and PDFBox would be useful also within the context
of the upstream libraries. For example I could easily see the
character run handling code in WordExtractor, the sparse sheet
capturing and rendering code in ExcelExtractor, or the annotation
handling code in PDF2XHTML becoming a more generally applicable part
of the upstream libraries.

So while having all this code in Tika makes it easy for us to maintain
consistency and rapid evolution in Tika, it introduces a barrier to
making the work we do useful also to a wider audience, and thus
ultimately reduces the rate of useful contributions we can expect.

During Tika 0.x I think the tradeoff favored focusing our work on Tika
itself, but now with stable 1.0 APIs I think the time may be ripe to
start reducing the size of tika-parsers (which has been growing pretty
much, see [1]).

[1] https://www.ohloh.net/p/tika/analyses/latest

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Jukka Zitting
In reply to this post by Michael McCandless-2
Hi,

On Tue, Dec 13, 2011 at 6:05 PM, Michael McCandless
<[hidden email]> wrote:
> It's true users could directly upgrade their PDFBox w/owaiting for a
> Tika release but I suspect most users don't do that...

Currently people don't do that because it's so easy to break things by
upgrading a parser library in sync with Tika. We've even been actively
discouraging people from selectively upgrading parser libraries to
avoid such problems.

With my proposal this problem would no longer apply, and we could
actually start proactively instructing people that they can and should
try upgrading the relevant parser libraries if they face problems with
a particular document.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Antoni Mylka-2
In reply to this post by Jukka Zitting
W dniu 2011-12-16 16:12, Jukka Zitting pisze:

>> And who's job would it be to test it? That's a general thing actually, how
>> much testing would need to remain on the Tika side?
>
> I'd still have the upstream libraries as dependencies of tika-parsers,
> and we definitely should continue maintaining a good set of
> integration tests there. On the other hand we already have many tests
> that actually test against issues in upstream parser libraries instead
> of any code in Tika, and I think those tests would be better located
> in the upstream projects. Ultimately test cases should go with the
> issues where particular problems or wishes were expressed.

The moment upstream libraries start depending in tika-core, they stop
being upstream libraries and become "side-stream" libraries. Putting POI
between core and parsers in the dependency chain will bring all sorts of
issues due to independent release cycles.

Therefore I think we should drop the dependency from tika-parsers to
POI, maintain integration tests in some other (new) maven module below
parsers, poi and pdfbox, and expose some tika-integration pom, which
will depend on tika-core, tika-parsers, and the latest-and-greatest
versions of poi and pdfbox compatible with the given core version. The
tika-integration pom could be updated after each release of an external
parser. All tutorials could then point out that you need to add one
dependency to your pom and that's tika-integration, e.g. with scope
"import".

In general pushing parsers "upstream" brings:
  - graceful degradation with missing dependencies
  - ability to use a later pdfbox without updating tika
  - "social" benefits of putting that code closer to people who'll
    know most about how to make it work

But:
  - the contract between core and parsers will have to be super-rigid.
    Now we can allow ourselves to say that core and parser jars must
    be of the same version. With upstream parsers, it will be more
    difficult. This applies to utils, common abstract classes etc.
    we'll need to look out for two cases
    - new pdfbox will not work with old versions of tika
    - when we release new tika version, old pdfbox may not work
      with it until the next release (assumes that tika-parsers
      don't depend on pdfbox, because then we're in trouble)
  - we gotta bring a bit more complexity to the module setup

I still feel it's worth it though.
WDYT?

Antoni Mylka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Antoni Mylka-2
In reply to this post by Jukka Zitting
W dniu 2011-12-16 16:12, Jukka Zitting pisze:
>> * Consistency - both or markup and metadata keys will be harder to
>>   ensure when it isn't in the same codebase
>
> Yep, that can be a problem. I guess the ultimate solution to this
> would be to come up with a well documented definition of what a parser
> should ideally output for specific kinds of content, but that's quite
> a bit of work.

There are (at least) two efforts to create a "well documented definition
of what a parser should ideally output for specific kinds of content".
One is shared-desktop-ontologies, spearheaded by Sebastian Trueg from
KDE (disclaimer: I was involved in early stages of this in 2007-2008).
It lives at oscaf.sf.net. The second is XMP.

I don't want to start new flames and understand that the current status
quo is probably the best possible, given all requirements, yet let's not
get carried away about creating yet another ultimate solution.

Antoni Mylka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Jukka Zitting
In reply to this post by Antoni Mylka-2
Hi,

On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka <[hidden email]> wrote:
> The moment upstream libraries start depending in tika-core, they stop being
> upstream libraries and become "side-stream" libraries. Putting POI between
> core and parsers in the dependency chain will bring all sorts of issues due
> to independent release cycles.

What issues? As long as we maintain proper backwards compatibility in
tika-core (we already have clirr configuration to automatically verify
this) there should be no problems with independent release cycles.

> Therefore I think we should drop the dependency from tika-parsers to POI,
> maintain integration tests in some other (new) maven module below parsers,
> poi and pdfbox, and expose some tika-integration pom, which will depend on
> tika-core, tika-parsers, and the latest-and-greatest versions of poi and
> pdfbox compatible with the given core version.

The tika-parsers component can already be used like this. The setup
I'm proposing has upstream parsers depending on tika-core, not
tika-parsers.

>  - the contract between core and parsers will have to be super-rigid.
>   Now we can allow ourselves to say that core and parser jars must
>   be of the same version. With upstream parsers, it will be more
>   difficult. This applies to utils, common abstract classes etc.
>   we'll need to look out for two cases
>   - new pdfbox will not work with old versions of tika

It will, as long as it's written against the 1.0 release instead of a
more recent 1.x version. If pdfbox explicitly needs a more recent Tika
version, then obviously it won't work with an older release, but such
cases should be fairly rare and clearly documented in the relevant
release notes or POM dependency settings.

>   - when we release new tika version, old pdfbox may not work
>     with it until the next release

We're explicitly committed to maintaining backwards compatiblity (see
https://issues.apache.org/jira/browse/TIKA-699) until Tika 2.0, so any
case where a new Tika release breaks an existing upstream parser
should be treated as a bug and fixed.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Jukka Zitting
In reply to this post by Antoni Mylka-2
Hi,

On Fri, Dec 16, 2011 at 8:04 PM, Antoni Mylka <[hidden email]> wrote:
> I don't want to start new flames and understand that the current status quo
> is probably the best possible, given all requirements, yet let's not get
> carried away about creating yet another ultimate solution.

I was just thinking of stuff like that a parser should preferably use
XMP schemas when exposing metadata, not about inventing our own
schemas.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Antoni Mylka-2
In reply to this post by Jukka Zitting
W dniu 2011-12-16 20:32, Jukka Zitting pisze:

> Hi,
>
> On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka<[hidden email]>  wrote:
>> The moment upstream libraries start depending in tika-core, they stop being
>> upstream libraries and become "side-stream" libraries. Putting POI between
>> core and parsers in the dependency chain will bring all sorts of issues due
>> to independent release cycles.
>
> What issues? As long as we maintain proper backwards compatibility in
> tika-core (we already have clirr configuration to automatically verify
> this) there should be no problems with independent release cycles.

Dunno, maybe I'm overreacting. I had two issues in mind

1. Incompatible changes in core which require adjustment of parsers. An
API vs. SPI question, where user-level API is set in stone, while
service implementor-level SPI is more flexible. Right now such tricks
are possible, with POI outside parsers they would be possible, with POI
between core and parsers they would be effectively impossible, as each
one would introduce a release deadlock. Since we are committed to
compatibility from all sides and make no distinction between API and SPI
policies then it's impossible anyway and this is a non-issue.

2. Exposing core-level, parser-related improvements to the general
public. Right now each parser may or may not implement support for
EmbeddedDocumentExtractor, or for DocumentSelector. I can imagine
expanding parsers with support for additional hooks like these, for
instance a password list for all parsers to try before giving up on an
encrypted document (doc, docx, pdf, zip etc.).

With the scenario you're proposing, exposing such functionality to the
general public will require two tika releases, not one:
  1. release tika with that hook
  2. release pdfbox with parser making use of that hook
  3. release tika with new pdfbox

With pdfbox outside parsers, step 3 wouldn't be necessary, but on a
second thought the user will still be able to exclude the bundled
version of pdfbox and call a new one in their app, with exactly the same
effect. Moreover such cases are likely to be rare.

So I guess I've refuted my own arguments :).

Antoni Mylka
[hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Pushing parsers upstream

Nick Burch-4
In reply to this post by Jukka Zitting
On 16/12/11 15:12, Jukka Zitting wrote:
> As mentioned by Antoni, in the end the metadata keys are just strings,
> so with a little coordination we don't need to delay the introduction
> of new keys over multiple releases.

Hmm, they're not quite just strings - with the new Property stuff they
can also have validation too. I think, however, that having a parser
temporarily include its only copy of a definition shouldn't be the end
of the world

> More generally though, I think it would make sense over time to have
> tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM,
> etc.) that aren't directly tied to any specific parser or file format.
> Format-specific keys like the ones we now have in the MSOffice
> interface

Ah, that MSOffice one is now badly named - lots of the other parsers
make use of keys that it provides. We should maybe rename it to
something more general, to indicate it relates to most productivity
document formats

In general though, I agree that re-using an existing defined key name
(eg xmp where it covers it) makes sense. At the very least, it avoids
work trying to come up with a name, and you get the documentation for
the entry for free :)

 > That way, as long as the generic metadata keys in
> tika-core are more or less complete (i.e. cover all of the key
> metadata standards), there should be little need for a parser
> implementation to need changes in the rest of Tika if it wants to
> introduce a new custom metadata key.

I think we're not quite there yet though, so for at least the next year
(at a guess) we're going to need to be adding new keys, and
rationalising existing ones

>> * Consistency - both or markup and metadata keys will be harder to
>>   ensure when it isn't in the same codebase
>
> Yep, that can be a problem. I guess the ultimate solution to this
> would be to come up with a well documented definition of what a parser
> should ideally output for specific kinds of content, but that's quite
> a bit of work.

Possibly we could use some tooling to identify the differences, then
have a periodic check to ensure things haven't got worse. My hunch is
that this shouldn't be too hard to setup, but I'm not volunteering to do
it...!

>> For detectors, there's extra issue here. At the moment, both the Zip and
>> OLE2 detectors handle more than just the POI formats, and in the Zip case
>> rely on code shared between the parsers (poi+keynote) and detector. How
>> would this work if the container detectors were handed to POI?
>
> I guess this would require some level of code duplication, i.e. having
> a Zip detector in POI that knows about OOXML types, and another in
> tika-parsers that knows about other types of Zips.

Hmm, I'd rather we didn't have too much duplication. I think this might
end up with quite a bit, and would need quite a lot of testing to ensure
things worked well. Potentially we could end up with something like 5
Zip based detectors in that model, such as:
* OOXML one, in POI (needs POI bits)
* iWorks one, in future iWorks library (needs iWorks parser bits)
* ODF one, in ODFToolkit (needs ODF bits)
* Core Tika one (zip, jar, war etc)

At that point maybe we need a zip detector plugin model...

(The OLE2 case is fine - because the detector is powered by POIFS, non
POI supported OLE2 formats are probably best detected by code within POI)


Nick