[DISCUSS] Unecessary deps exclusion in `tika-parsers`

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Unecessary deps exclusion in `tika-parsers`

Konstantin Gribov
Hi, folks.

It seems that we have too much dependencies in `tika-parsers` and many of
them could actually be not used. As Tim found in TIKA-2007 [1]
`jackson-core` wasn't necessary for `tika-parsers` at all.

When I looked into current parser deps I found a lot of strange deps like
`quartz` with `c3p0` (jdbc connection pool impl) and `ehcache-core` via
`cdm`, lucene parts (via `ctakes-core`), spring framework 3.x (also via
`ctakes-core`) et cetera. Latter could even break app if you have another
spring version in transitive deps.

Also, there seems to be no tests for ctakes parser on the first glance and
I have no easy way to check what I can exclude from deps without breaking
things.

What do you think about shrinking some of such deps? With at least minimal
test coverage to ensure common usecases won't be broken, of course.

[1]:
https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206
--

Best regards,
Konstantin Gribov
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

kkrugler
I think excluding more deps would be good…but challenging.

The problem is that some of the jars only wind up getting used for edge cases (e.g. you have an encrypted email, and so you need bouncy castle, or something like that which had bitten me in the past).

So it’s hard to know what’s really required or not. Is there a good Java tool for tracing all possible calls from starting points, to see if it’s even possible to reach a jar?

Though that would need some help for cases where we’re dynamically loading classes (mostly plug-in support?)

— Ken


> On Aug 24, 2016, at 10:59am, Konstantin Gribov <[hidden email]> wrote:
>
> Hi, folks.
>
> It seems that we have too much dependencies in `tika-parsers` and many of
> them could actually be not used. As Tim found in TIKA-2007 [1]
> `jackson-core` wasn't necessary for `tika-parsers` at all.
>
> When I looked into current parser deps I found a lot of strange deps like
> `quartz` with `c3p0` (jdbc connection pool impl) and `ehcache-core` via
> `cdm`, lucene parts (via `ctakes-core`), spring framework 3.x (also via
> `ctakes-core`) et cetera. Latter could even break app if you have another
> spring version in transitive deps.
>
> Also, there seems to be no tests for ctakes parser on the first glance and
> I have no easy way to check what I can exclude from deps without breaking
> things.
>
> What do you think about shrinking some of such deps? With at least minimal
> test coverage to ensure common usecases won't be broken, of course.
>
> [1]:
> https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206
> --
>
> Best regards,
> Konstantin Gribov

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr



Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

Konstantin Gribov
As I know proguard does such tracing internally but it works only for
trivial cases (like `Class.forName` with string constant, see [1]).
Another simple was is to monitor which classes were loaded with
`-verbose:class` in case of hotspot [2].

But second way wouldn't show classes which weren't loaded because of lack
of tests like with ctakes parser.
At least, such method catches SPI and alike dynamic loading of
plugins/modules.

Also we have optional deps like Stanford CoreNLP (because of license AFAIK)
which wouldn't be covered with either method.

It would be hard to do fine grained exclusion but I advocate for coarse
grained one.
It could give noticable result with moderate effort, IMHO.

To be honest, I just exclude edu.ucar and similar deps because of their
huge footprint when use Tika since I can trade off support of some
scientific formats for smaller footprint in my cases and this issue doesn't
affect me directly.

[1]: http://proguard.sourceforge.net/index.html#manual/usage.html
[2]: http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtm


ср, 24 авг. 2016 г. в 21:16, Ken Krugler <[hidden email]>:

> I think excluding more deps would be good…but challenging.
>
> The problem is that some of the jars only wind up getting used for edge
> cases (e.g. you have an encrypted email, and so you need bouncy castle, or
> something like that which had bitten me in the past).
>
> So it’s hard to know what’s really required or not. Is there a good Java
> tool for tracing all possible calls from starting points, to see if it’s
> even possible to reach a jar?
>
> Though that would need some help for cases where we’re dynamically loading
> classes (mostly plug-in support?)
>
> — Ken
>
>
> > On Aug 24, 2016, at 10:59am, Konstantin Gribov <[hidden email]>
> wrote:
> >
> > Hi, folks.
> >
> > It seems that we have too much dependencies in `tika-parsers` and many of
> > them could actually be not used. As Tim found in TIKA-2007 [1]
> > `jackson-core` wasn't necessary for `tika-parsers` at all.
> >
> > When I looked into current parser deps I found a lot of strange deps like
> > `quartz` with `c3p0` (jdbc connection pool impl) and `ehcache-core` via
> > `cdm`, lucene parts (via `ctakes-core`), spring framework 3.x (also via
> > `ctakes-core`) et cetera. Latter could even break app if you have another
> > spring version in transitive deps.
> >
> > Also, there seems to be no tests for ctakes parser on the first glance
> and
> > I have no easy way to check what I can exclude from deps without breaking
> > things.
> >
> > What do you think about shrinking some of such deps? With at least
> minimal
> > test coverage to ensure common usecases won't be broken, of course.
> >
> > [1]:
> >
> https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206
> > --
> >
> > Best regards,
> > Konstantin Gribov
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
> --

Best regards,
Konstantin Gribov
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Unecessary deps exclusion in `tika-parsers`

kkrugler

> On Aug 24, 2016, at 11:37am, Konstantin Gribov <[hidden email]> wrote:
>
> As I know proguard does such tracing internally but it works only for
> trivial cases (like `Class.forName` with string constant, see [1]).
> Another simple was is to monitor which classes were loaded with
> `-verbose:class` in case of hotspot [2].
>
> But second way wouldn't show classes which weren't loaded because of lack
> of tests like with ctakes parser.
> At least, such method catches SPI and alike dynamic loading of
> plugins/modules.
>
> Also we have optional deps like Stanford CoreNLP (because of license AFAIK)
> which wouldn't be covered with either method.
>
> It would be hard to do fine grained exclusion but I advocate for coarse
> grained one.
> It could give noticable result with moderate effort, IMHO.

I think with the Tika 2.0 work this is one of the goals, in that the parsers are broken into groups (with separate dependency sets).

So the “standard” set of parsers might be all you want/need, and then you won’t be pulling in a bunch of jars for other formats that you don’t care about.

— Ken

PS - http://stackoverflow.com/questions/4951517/static-analysis-of-java-call-graph has some useful refs.

>
> To be honest, I just exclude edu.ucar and similar deps because of their
> huge footprint when use Tika since I can trade off support of some
> scientific formats for smaller footprint in my cases and this issue doesn't
> affect me directly.
>
> [1]: http://proguard.sourceforge.net/index.html#manual/usage.html
> [2]: http://www.oracle.com/technetwork/java/javase/clopts-139448.html#gbmtm
>
>
> ср, 24 авг. 2016 г. в 21:16, Ken Krugler <[hidden email]>:
>
>> I think excluding more deps would be good…but challenging.
>>
>> The problem is that some of the jars only wind up getting used for edge
>> cases (e.g. you have an encrypted email, and so you need bouncy castle, or
>> something like that which had bitten me in the past).
>>
>> So it’s hard to know what’s really required or not. Is there a good Java
>> tool for tracing all possible calls from starting points, to see if it’s
>> even possible to reach a jar?
>>
>> Though that would need some help for cases where we’re dynamically loading
>> classes (mostly plug-in support?)
>>
>> — Ken
>>
>>
>>> On Aug 24, 2016, at 10:59am, Konstantin Gribov <[hidden email]>
>> wrote:
>>>
>>> Hi, folks.
>>>
>>> It seems that we have too much dependencies in `tika-parsers` and many of
>>> them could actually be not used. As Tim found in TIKA-2007 [1]
>>> `jackson-core` wasn't necessary for `tika-parsers` at all.
>>>
>>> When I looked into current parser deps I found a lot of strange deps like
>>> `quartz` with `c3p0` (jdbc connection pool impl) and `ehcache-core` via
>>> `cdm`, lucene parts (via `ctakes-core`), spring framework 3.x (also via
>>> `ctakes-core`) et cetera. Latter could even break app if you have another
>>> spring version in transitive deps.
>>>
>>> Also, there seems to be no tests for ctakes parser on the first glance
>> and
>>> I have no easy way to check what I can exclude from deps without breaking
>>> things.
>>>
>>> What do you think about shrinking some of such deps? With at least
>> minimal
>>> test coverage to ensure common usecases won't be broken, of course.
>>>
>>> [1]:
>>>
>> https://issues.apache.org/jira/browse/TIKA-2007?focusedCommentId=15435206&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15435206
>>> --
>>>
>>> Best regards,
>>> Konstantin Gribov
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>> --
>
> Best regards,
> Konstantin Gribov

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr