Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Bob Paulin-2
Hi,

Currently the Apache POI dependency is in several modules and it's sort
of a beast (> 2 MB in size).   It appears many of the modules are only
using the IOUtils library.  The big exception is the office module which
is responsible for parsing documents. These methods appear to also exist
in commons io which is only ~ 180 KB. Any concerns with replacing this
POI stuff with commons-io?  Does POI offer anything above the commons-io
functionality in IOUtils? If not I think it would be great to isolate
the poi dependency to the office module only.

- Bob
Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Bob Paulin-2
There is also

org.apache.poi.util.StringUtil (in cad module)
and
org.apache.poi.util.LittleEndian (in code module)

Neither of these seem to have commons libraries replacements from what I
can see.  Given the small amount  of code in the methods that are
actually used would it make sense to move these into tika-core? Perhaps
suggest a patch to the commons to take on some of these? Other suggestions?

- Bob

On 3/27/2016 9:39 AM, Bob Paulin wrote:

> Hi,
>
> Currently the Apache POI dependency is in several modules and it's
> sort of a beast (> 2 MB in size).   It appears many of the modules are
> only using the IOUtils library.  The big exception is the office
> module which is responsible for parsing documents. These methods
> appear to also exist in commons io which is only ~ 180 KB. Any
> concerns with replacing this POI stuff with commons-io?  Does POI
> offer anything above the commons-io functionality in IOUtils? If not I
> think it would be great to isolate the poi dependency to the office
> module only.
>
> - Bob

Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Nick Burch-2
In reply to this post by Bob Paulin-2
On Sun, 27 Mar 2016, Bob Paulin wrote:
> Currently the Apache POI dependency is in several modules and it's sort
> of a beast (> 2 MB in size).

You should've seen it before Jukka and Yegor spent a crazy ApacheCon
hacking up the ooxml-lite support... ;-)

> It appears many of the modules are only using the IOUtils library.

I suspect a strong overlap with the parser classes I've helped write...

> Any concerns with replacing this POI stuff with commons-io? Does POI
> offer anything above the commons-io functionality in IOUtils? If not I
> think it would be great to isolate the poi dependency to the office
> module only.

A lot of the use is for endian-specific reading of numbers and strings.
Might be a bit of stream stuff, but mostly that can be passed off to the
Tika IO utils classes.

From a quick check, I can't see any endian number stuff in commons IO, but
I might of missed it, or it might be in a different commons module. If
not, there might be something to be said for popping that POI logic along
with some of the Ogg-Vorbis utils stuff (another one with my grubby mits
all over it) into a more helpful general utils grouping

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Bob Paulin-2
Hi Nick,

On 3/27/2016 6:52 PM, Nick Burch wrote:
> On Sun, 27 Mar 2016, Bob Paulin wrote:
>> Currently the Apache POI dependency is in several modules and it's
>> sort of a beast (> 2 MB in size).
>
> You should've seen it before Jukka and Yegor spent a crazy ApacheCon
> hacking up the ooxml-lite support... ;-)
I can only imagine.

>
>> It appears many of the modules are only using the IOUtils library.
>
> I suspect a strong overlap with the parser classes I've helped write...
>
>> Any concerns with replacing this POI stuff with commons-io? Does POI
>> offer anything above the commons-io functionality in IOUtils? If not
>> I think it would be great to isolate the poi dependency to the office
>> module only.
>
> A lot of the use is for endian-specific reading of numbers and
> strings. Might be a bit of stream stuff, but mostly that can be passed
> off to the Tika IO utils classes.
Didn't even think of looking at Tika IO but yes that would be even better.
>
>> From a quick check, I can't see any endian number stuff in commons
>> IO, but
> I might of missed it, or it might be in a different commons module. If
> not, there might be something to be said for popping that POI logic
> along with some of the Ogg-Vorbis utils stuff (another one with my
> grubby mits all over it) into a more helpful general utils grouping
Yes I think overall if these functions can live in somewhere either
inside tika or a smaller dependent library we're in a better place. I'll
take a look at Ogg-Vorbis.

Thanks!
>
> Nick
>

Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Bob Paulin-2
In reply to this post by Nick Burch-2
Tika's IOUtils appears to be missing the readFully method.  Should that
be added?

- Bob

On 3/27/2016 6:52 PM, Nick Burch wrote:

> On Sun, 27 Mar 2016, Bob Paulin wrote:
>> Currently the Apache POI dependency is in several modules and it's
>> sort of a beast (> 2 MB in size).
>
> You should've seen it before Jukka and Yegor spent a crazy ApacheCon
> hacking up the ooxml-lite support... ;-)
>
>> It appears many of the modules are only using the IOUtils library.
>
> I suspect a strong overlap with the parser classes I've helped write...
>
>> Any concerns with replacing this POI stuff with commons-io? Does POI
>> offer anything above the commons-io functionality in IOUtils? If not
>> I think it would be great to isolate the poi dependency to the office
>> module only.
>
> A lot of the use is for endian-specific reading of numbers and
> strings. Might be a bit of stream stuff, but mostly that can be passed
> off to the Tika IO utils classes.
>
>> From a quick check, I can't see any endian number stuff in commons
>> IO, but
> I might of missed it, or it might be in a different commons module. If
> not, there might be something to be said for popping that POI logic
> along with some of the Ogg-Vorbis utils stuff (another one with my
> grubby mits all over it) into a more helpful general utils grouping
>
> Nick
>

Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Nick Burch-2
In reply to this post by Bob Paulin-2
On Sun, 27 Mar 2016, Bob Paulin wrote:
> Yes I think overall if these functions can live in somewhere either
> inside tika or a smaller dependent library we're in a better place. I'll
> take a look at Ogg-Vorbis.

The two util classes there, that spring to mind, are:
https://github.com/Gagravarr/VorbisJava/blob/master/core/src/main/java/org/gagravarr/ogg/IOUtils.java
https://github.com/Gagravarr/VorbisJava/blob/master/core/src/main/java/org/gagravarr/ogg/BitsReader.java

Nick
Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Nick Burch-2
In reply to this post by Bob Paulin-2
On Sun, 27 Mar 2016, Bob Paulin wrote:
> Tika's IOUtils appears to be missing the readFully method.  Should that
> be added?

There was discussion about getting rid of the Tika IOUtils method in
favour of depending on commons-io. If that method is on commons-io, then
we could use that without needing to add to Tika IOUtils. If not, probably
best to ask the Commons community to add our method there, wait for a new
release and use that!

Nick
Reply | Threaded
Open this post in threaded view
|

RE: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

kkrugler
Hi Bob,

> From: Nick Burch
> Sent: March 28, 2016 6:49:09am PDT
> To: [hidden email]
> Subject: Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils
>
> On Sun, 27 Mar 2016, Bob Paulin wrote:
>> Tika's IOUtils appears to be missing the readFully method.  Should that be added?
>
> There was discussion about getting rid of the Tika IOUtils method in favour of depending on commons-io. If that method is on commons-io, then we could use that without needing to add to Tika IOUtils. If not, probably best to ask the Commons community to add our method there, wait for a new release and use that!

See https://issues.apache.org/jira/browse/TIKA-1706 for the issue - and seems like 2.0 is a fine place to make the clean switch to just using Commons IOUtils.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply | Threaded
Open this post in threaded view
|

Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils

Bob Paulin-2
Ken,

Thank you for reminding me of this issue.  Seems we had come to the
agreement to use commons-io in a later version.  Doing this in tika-core
would make it a transitive dependency to all the 2.0 parsers which again
would just leave the string utils and LittleEndian code to port over to a
library or within tika-core.

- Bob

On Mon, Mar 28, 2016 at 9:18 AM, Ken Krugler <[hidden email]>
wrote:

> Hi Bob,
>
> > From: Nick Burch
> > Sent: March 28, 2016 6:49:09am PDT
> > To: [hidden email]
> > Subject: Re: Tika 2.0 - Replace POI IOUtils with commons-io IOUtils
> >
> > On Sun, 27 Mar 2016, Bob Paulin wrote:
> >> Tika's IOUtils appears to be missing the readFully method.  Should that
> be added?
> >
> > There was discussion about getting rid of the Tika IOUtils method in
> favour of depending on commons-io. If that method is on commons-io, then we
> could use that without needing to add to Tika IOUtils. If not, probably
> best to ask the Commons community to add our method there, wait for a new
> release and use that!
>
> See https://issues.apache.org/jira/browse/TIKA-1706 for the issue - and
> seems like 2.0 is a fine place to make the clean switch to just using
> Commons IOUtils.
>
> -- Ken
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>