Working with unreleased POI code

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Working with unreleased POI code

Jukka Zitting
Hi,

I looked at enhancing the structured parsing abilities of the MS
Office parsers, but except for Excel I don't think it makes sense to
add much new stuff there until the relevant POI libraries are more
feature-rich. I've just contacted the POI team about getting some of
their scratchpad code released so we could leverage it in Tika.

I don't think it's a good idea to introduce unreleased dependencies to
Tika, but how about if I started a sandbox area in SVN for Parser
components based on unreleased or otherwise experimental code? That
would help us work with external projects and provide better feedback
to them already before they make new releases.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Working with unreleased POI code

Sami Siren-2
2008/2/17, Jukka Zitting <[hidden email]>:

> I don't think it's a good idea to introduce unreleased dependencies to
> Tika,

+1

> but how about if I started a sandbox area in SVN for Parser
> components based on unreleased or otherwise experimental code?

+1

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Working with unreleased POI code

Bertrand Delacretaz-2
In reply to this post by Jukka Zitting
On Feb 17, 2008 1:31 PM, Jukka Zitting <[hidden email]> wrote:

> ...I don't think it's a good idea to introduce unreleased dependencies to
> Tika, but how about if I started a sandbox area in SVN for Parser
> components based on unreleased or otherwise experimental code?..

+1

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Working with unreleased POI code

Jukka Zitting
In reply to this post by Jukka Zitting
Hi,

On Feb 17, 2008 2:31 PM, Jukka Zitting <[hidden email]> wrote:
> I looked at enhancing the structured parsing abilities of the MS
> Office parsers, but except for Excel I don't think it makes sense to
> add much new stuff there until the relevant POI libraries are more
> feature-rich. I've just contacted the POI team about getting some of
> their scratchpad code released so we could leverage it in Tika.

It turned out that they're already releasing the scratchpad code as a
separate Maven artifact, so for now I've simply added that as another
normal dependency and replaced our custom Word and PowerPoint parsing
code with text extractors from POI. I'll be looking at adding more
fine-grained parsing based on existing POI features.

BR,

Jukka Zitting