Splitting Tika to separate modules

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Splitting Tika to separate modules

Jukka Zitting
Hi,

Revisiting a topic that we've considered already before (in at least
[1], [2] and [3])...

I'm working on integrating Tika to Jackrabbit [4], and there we found
it desirable [5] to make it easier to depend on just the core Tika
classes without all the parser dependencies.

To make this happen, I'd split Tika into following component libraries:

* tika-core - core parts of Tika; everything but cli, gui, and the
parser.* packages
* tika-parsers - format-specific parser classes; with dependencies to
external libraries
* tika-app - depends on all of the above; adds cli and gui; standalone
jar packaging

We could (should?) further split the tika-parsers component into
smaller pieces based on the external dependencies used to allow
finer-grained control over what parser libraries get included in a
specific downstream package or deployment.

WDYT? If there are no objections, I'd like to target this for the Tika
0.4 release.

[1] http://markmail.org/message/n64zb3cawlm4ng3k
[2] http://markmail.org/message/ji3xabugnt6wlwdh
[3] http://markmail.org/message/2sd6d5ajhpqhcwcf
[4] https://issues.apache.org/jira/browse/JCR-1878
[5] http://markmail.org/message/cf6bj7qv7fyyxezu

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Splitting Tika to separate modules

AJ Chen-2
Splitting to three major components will certainly help re-usability, but
too many small components may make it less convenient to use because of the
large number of jars.

A different question: does tika plan to provide function for scraping web
page? tika html parser provides everything on html page. for some
applications such as search, it's required to exclude sections including
advertising, menu, footer, etc.  it would be extremely useful to have
scraping capability in tika. Has anybody developed web page scraping code on
top of tika?

thanks,
aj

On Wed, Apr 8, 2009 at 6:58 AM, Jukka Zitting <[hidden email]>wrote:

> Hi,
>
> Revisiting a topic that we've considered already before (in at least
> [1], [2] and [3])...
>
> I'm working on integrating Tika to Jackrabbit [4], and there we found
> it desirable [5] to make it easier to depend on just the core Tika
> classes without all the parser dependencies.
>
> To make this happen, I'd split Tika into following component libraries:
>
> * tika-core - core parts of Tika; everything but cli, gui, and the
> parser.* packages
> * tika-parsers - format-specific parser classes; with dependencies to
> external libraries
> * tika-app - depends on all of the above; adds cli and gui; standalone
> jar packaging
>
> We could (should?) further split the tika-parsers component into
> smaller pieces based on the external dependencies used to allow
> finer-grained control over what parser libraries get included in a
> specific downstream package or deployment.
>
> WDYT? If there are no objections, I'd like to target this for the Tika
> 0.4 release.
>
> [1] http://markmail.org/message/n64zb3cawlm4ng3k
> [2] http://markmail.org/message/ji3xabugnt6wlwdh
> [3] http://markmail.org/message/2sd6d5ajhpqhcwcf
> [4] https://issues.apache.org/jira/browse/JCR-1878
> [5] http://markmail.org/message/cf6bj7qv7fyyxezu
>
> BR,
>
> Jukka Zitting
>



--
AJ Chen, PhD
Co-Chair, Semantic Web SIG, sdforum.org
Technical Architect, healthline.com
http://web2express.org
Palo Alto, CA
Reply | Threaded
Open this post in threaded view
|

Re: Splitting Tika to separate modules

Jonathan Koren

On Apr 8, 2009, at 11:10 AM, AJ Chen wrote:

> A different question: does tika plan to provide function for  
> scraping web
> page? tika html parser provides everything on html page. for some
> applications such as search, it's required to exclude sections  
> including
> advertising, menu, footer, etc.  it would be extremely useful to have
> scraping capability in tika. Has anybody developed web page scraping  
> code on
> top of tika?


Well a webpage is already parsable HTML so I don't know exactly why  
Tika would be the relevant thing to use here.  Excluding certain  
sections of a page is an application specific task.  To turn your  
example on its head, perhaps you want to read only the advertisements  
for some sort of business/marketing reason.

--
Jonathan Koren
[hidden email]
http://www.soe.ucsc.edu/~jonathan/


Reply | Threaded
Open this post in threaded view
|

Re: Splitting Tika to separate modules

Michael Wechner
In reply to this post by Jukka Zitting
Jukka Zitting schrieb:

> Hi,
>
> Revisiting a topic that we've considered already before (in at least
> [1], [2] and [3])...
>
> I'm working on integrating Tika to Jackrabbit [4], and there we found
> it desirable [5] to make it easier to depend on just the core Tika
> classes without all the parser dependencies.
>
> To make this happen, I'd split Tika into following component libraries:
>
> * tika-core - core parts of Tika; everything but cli, gui, and the
> parser.* packages
> * tika-parsers - format-specific parser classes; with dependencies to
> external libraries
> * tika-app - depends on all of the above; adds cli and gui; standalone
> jar packaging
>
> We could (should?) further split the tika-parsers component into
> smaller pieces based on the external dependencies used to allow
> finer-grained control over what parser libraries get included in a
> specific downstream package or deployment.
>
> WDYT? If there are no objections, I'd like to target this for the Tika
> 0.4 release.
>  

+1, whereas what does it mean exactly re backwards compatibility or
rather current projects using for example 0.1-incubating?
      (does it just mean re-configuring lib dependencies or does it mean
more, and if so, what more?)

Thanks

Michael
Reply | Threaded
Open this post in threaded view
|

Re: Splitting Tika to separate modules

Jukka Zitting
Hi,

On Wed, Apr 8, 2009 at 11:39 PM, Michael Wechner
<[hidden email]> wrote:
> +1, whereas what does it mean exactly re backwards compatibility or rather
> current projects using for example 0.1-incubating?

We should be able to do the split simply by packaging the existing
classes differently, i.e. no changes would be needed in downstream
code. Simply reconfiguring your dependencies should be enough.

On the other hand we may want to think about making the Java package
naming better match the component boundaries. Perhaps we should keep
the current package names for the 0.x releases, and reconsider the
structure before doing 1.0 when we have more experience on how the
Tika codebase should best be split to components.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Splitting Tika to separate modules

Rene Wiermer
In reply to this post by Jukka Zitting
Am Mittwoch, den 08.04.2009, 15:58 +0200 schrieb Jukka Zitting:

> Hi,
>
> Revisiting a topic that we've considered already before (in at least
> [1], [2] and [3])...
>
> I'm working on integrating Tika to Jackrabbit [4], and there we found
> it desirable [5] to make it easier to depend on just the core Tika
> classes without all the parser dependencies.
>
> To make this happen, I'd split Tika into following component libraries:
>
> * tika-core - core parts of Tika; everything but cli, gui, and the
> parser.* packages
> * tika-parsers - format-specific parser classes; with dependencies to
> external libraries
> * tika-app - depends on all of the above; adds cli and gui; standalone
> jar packaging
>
> We could (should?) further split the tika-parsers component into
> smaller pieces based on the external dependencies used to allow
> finer-grained control over what parser libraries get included in a
> specific downstream package or deployment.
>
> WDYT? If there are no objections, I'd like to target this for the Tika
> 0.4 release.
>
> [1] http://markmail.org/message/n64zb3cawlm4ng3k
> [2] http://markmail.org/message/ji3xabugnt6wlwdh
> [3] http://markmail.org/message/2sd6d5ajhpqhcwcf
> [4] https://issues.apache.org/jira/browse/JCR-1878
> [5] http://markmail.org/message/cf6bj7qv7fyyxezu
>
> BR,
>
> Jukka Zitting
>

+1

In my use case, it would be ideal to add custom parsers to the auto
detection "on the fly".

I have different ideas how to implement that
a) make TikaConfig more flexible by adding setters for the parsers

e.g.
TikaConfig conf = TikaConfig.getDefaultConfig();
//CHANGED
conf.setParser("application/custom",MyCustomParserClass);
//
AutoDetectParser parser = new AutoDetectParser(conf);

This is trivial and almost non-intrusive, but leaves the work on the
client side

b) extend the Parser interface to let the parsers themselves report
their capabilities (something like MyParser.getSupportedTypes()) and add
some class loading magic, e.g. specify a plugin directory in the config
file and load every class in there.
My guts tell me, this could be a hack if not done right. But I like the
administrative view of this.

c) let a professional do it, like OSGi (Apache Felix)
Allows elegant runtime changes. It adds, however, another (small)
dependency and needs changes in the structure.


What do you think ? I would implement one of this (or similar ones), if
there is an interest and no conflict with other plans.

René Wiermer

Reply | Threaded
Open this post in threaded view
|

Re: Splitting Tika to separate modules

Michael Wechner
In reply to this post by Jukka Zitting
Jukka Zitting schrieb:

> Hi,
>
> On Wed, Apr 8, 2009 at 11:39 PM, Michael Wechner
> <[hidden email]> wrote:
>  
>> +1, whereas what does it mean exactly re backwards compatibility or rather
>> current projects using for example 0.1-incubating?
>>    
>
> We should be able to do the split simply by packaging the existing
> classes differently, i.e. no changes would be needed in downstream
> code. Simply reconfiguring your dependencies should be enough.
>  

ok
> On the other hand we may want to think about making the Java package
> naming better match the component boundaries. Perhaps we should keep
> the current package names for the 0.x releases, and reconsider the
> structure before doing 1.0 when we have more experience on how the
> Tika codebase should best be split to components.
>  

sounds good to me

Thanks

Michael
> BR,
>
> Jukka Zitting
>