Support for document libraries

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Support for document libraries

Carsten Ziegeler
Afaik there is currently no central place at Apache where
libraries/frameworks for handling of specific document formats are
developed. We have single projects like poi of course.

If you are searching for java libraries which support a specific format,
like some image formats, you'll find many libraries of varying quality
and it's really hard (if not impossible) to choose a correct one.

I'm wondering if something could be done about it by starting a project
at Apache which supports various file formats (like images, mp3 etc.) -
perhaps by incubating some existing stuff.

Although Tika is more the framework for plugin in such stuff, it perhaps
makes sense to try to start something like that as sub projects of Tika?

WDYT?

Carsten
--
Carsten Ziegeler
[hidden email]


signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Support for document libraries

Bertrand Delacretaz
On 7/10/07, Carsten Ziegeler <[hidden email]> wrote:

>... Although Tika is more the framework for plugin in such stuff, it perhaps
> makes sense to try to start something like that as sub projects of Tika?...

I would agree, although IMHO Tika should reuse existing libraries as
much as possible.

In some cases, the Tika part could just consist of automated tests for
existing libraries, to help in selecting and validating them.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Support for document libraries

Carsten Ziegeler
Bertrand Delacretaz wrote:

> On 7/10/07, Carsten Ziegeler <[hidden email]> wrote:
>
>> ... Although Tika is more the framework for plugin in such stuff, it
>> perhaps
>> makes sense to try to start something like that as sub projects of
>> Tika?...
>
> I would agree, although IMHO Tika should reuse existing libraries as
> much as possible.
>
Yes, it doesn't make sense to reinvent the wheel if there are
good-enough libraries out there. But afaik for several formats there
aren't suitable libs available, so these are the cases where I think
that it makes sense to "drag them in".

> In some cases, the Tika part could just consist of automated tests for
> existing libraries, to help in selecting and validating them.
>
> -Bertrand
>


--
Carsten Ziegeler
[hidden email]


signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Support for document libraries

robert burrell donkin-2
On 7/10/07, Carsten Ziegeler <[hidden email]> wrote:

> Bertrand Delacretaz wrote:
> > On 7/10/07, Carsten Ziegeler <[hidden email]> wrote:
> >
> >> ... Although Tika is more the framework for plugin in such stuff, it
> >> perhaps
> >> makes sense to try to start something like that as sub projects of
> >> Tika?...
> >
> > I would agree, although IMHO Tika should reuse existing libraries as
> > much as possible.
> >
> Yes, it doesn't make sense to reinvent the wheel if there are
> good-enough libraries out there. But afaik for several formats there
> aren't suitable libs available, so these are the cases where I think
> that it makes sense to "drag them in".

IMHO it makes sense to start them in tika but possibly commons might
be a good long term home for some at least. if these really are
libraries then it would be best to isolate them from the start and
then add adaption code to tika.

for example, there is talk of a couple of possible options for
MIME-type discovery. perhaps it would make sense to factor both
options as libraries and just have the adapters in tika.

- robert
Reply | Threaded
Open this post in threaded view
|

Re: Support for document libraries

Carsten Ziegeler
robert burrell donkin wrote:
> IMHO it makes sense to start them in tika but possibly commons might
> be a good long term home for some at least. if these really are
> libraries then it would be best to isolate them from the start and
> then add adaption code to tika.
>
> for example, there is talk of a couple of possible options for
> MIME-type discovery. perhaps it would make sense to factor both
> options as libraries and just have the adapters in tika.
>
Yes, that definitly makes sense - these libs could be independent from
the "core" and the core must definitly be independent from the libs.
And layering this with adapters is a good idea.

Carsten

--
Carsten Ziegeler
[hidden email]


signature.asc (257 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Support for document libraries

Jukka Zitting
In reply to this post by robert burrell donkin-2
Hi,

On 7/10/07, robert burrell donkin <[hidden email]> wrote:
> IMHO it makes sense to start them in tika but possibly commons might
> be a good long term home for some at least. if these really are
> libraries then it would be best to isolate them from the start and
> then add adaption code to tika.

Another potential home would be POI if they are interested in widening
their scope beyond Microsoft formats.

> for example, there is talk of a couple of possible options for
> MIME-type discovery. perhaps it would make sense to factor both
> options as libraries and just have the adapters in tika.

+1

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Support for document libraries

Jeremias Maerki-2
In reply to this post by Carsten Ziegeler
Adding document format libraries as subprojects of Tika still "hides"
them somewhat. So this wouldn't really solve the problem of easily
finding such libraries. If new libraries should be developed, I would
think that a lab or Commons is better suited.

There were many talks over the years about creating an image library
inside the ASF but it has never developed into a real effort. It's a lot
of work and with ImageIO built into the JDK only exotic wishes are still
open.

If we had a Tika Wiki we could at least list potential existing libraries
and libraries that we'd like but don't exist. We could list licenses,
candidates for incubation, quality/maturity indicators...

Inside the XML Graphics project, we have the following available (if
anyone is interested to know):
* XMP metadata framework in XML Graphics Commons, read/write, work in
progress
* PostScript DSC in XML Graphics Commons, read/write (no PS interpreter!)
* PNG and TIFF codecs in XML Graphics Commons, read/write
* PDF in FOP, write only
* RTF in FOP, write only
* SVG in Batik, read/write

Others:
PDF (PDFBox @SourceForge), read/write, signalled interest for incubation

personal wishlist:
ODF, read/write
Mars, read/write

On 10.07.2007 09:18:33 Carsten Ziegeler wrote:

> Afaik there is currently no central place at Apache where
> libraries/frameworks for handling of specific document formats are
> developed. We have single projects like poi of course.
>
> If you are searching for java libraries which support a specific format,
> like some image formats, you'll find many libraries of varying quality
> and it's really hard (if not impossible) to choose a correct one.
>
> I'm wondering if something could be done about it by starting a project
> at Apache which supports various file formats (like images, mp3 etc.) -
> perhaps by incubating some existing stuff.
>
> Although Tika is more the framework for plugin in such stuff, it perhaps
> makes sense to try to start something like that as sub projects of Tika?
>
> WDYT?
>
> Carsten
> --
> Carsten Ziegeler
> [hidden email]
>


Jeremias Maerki