Multilingual Tika

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Multilingual Tika

Jukka Zitting
Hi,

With Tika 1.0 almost done (how cool is that!), I think it's time to
start looking forward to what we'll be doing during the 1.x cycle. One
thing I've had in mind for a long time is to make Tika more easily
usable in programming languages other than Java.

The tika-app jar already helps with that and I know there are people
using Tika in .NET with IKVM, but it would be nice to see more tighter
Tika integration also to languages like Python, Ruby, Javascript, Perl
and PHP. Could we for example make a Ruby Gem out of Tika?

The Tika facade class provides a pretty nice set of basic
functionality that should be reasonably easy to port to other
languages. More advanced Tika constructs like the SAX event mechanism
or things like the ParseContext are probably trickier to port, so as a
first step I'd be interested in looking at simply providing a basic
set of Tika.py, Tika.rb, Tika.js, Tika.pm and Tika.php bindings (plus
whatever else people may be interested in) that just reflect the key
functionality found in Tika.java.

Anyone interested in joining such an effort? Any pointers to existing
work along similar lines?

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Multilingual Tika

Mattmann, Chris A (3010)
Hey Jukka,

I totally am. I've got some PHP skillz and Python skillz
that I would be willing to throw into the mix here.

One other thing along these lines I've had in mind for a while:
how cool would it be to have a CentOS RPM, or Debian pkg
or something like this and try and get tika into the std Linux
distributions? Like you install Linux and then you have the
tika command (maybe a wrapper around tika-app) at your
disposal? That would be awesome.

Anyhoo I'll be here to lend a hand when we're ready to get
started!

Cheers,
Chris

On Nov 4, 2011, at 5:22 PM, Jukka Zitting wrote:

> Hi,
>
> With Tika 1.0 almost done (how cool is that!), I think it's time to
> start looking forward to what we'll be doing during the 1.x cycle. One
> thing I've had in mind for a long time is to make Tika more easily
> usable in programming languages other than Java.
>
> The tika-app jar already helps with that and I know there are people
> using Tika in .NET with IKVM, but it would be nice to see more tighter
> Tika integration also to languages like Python, Ruby, Javascript, Perl
> and PHP. Could we for example make a Ruby Gem out of Tika?
>
> The Tika facade class provides a pretty nice set of basic
> functionality that should be reasonably easy to port to other
> languages. More advanced Tika constructs like the SAX event mechanism
> or things like the ParseContext are probably trickier to port, so as a
> first step I'd be interested in looking at simply providing a basic
> set of Tika.py, Tika.rb, Tika.js, Tika.pm and Tika.php bindings (plus
> whatever else people may be interested in) that just reflect the key
> functionality found in Tika.java.
>
> Anyone interested in joining such an effort? Any pointers to existing
> work along similar lines?
>
> BR,
>
> Jukka Zitting


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply | Threaded
Open this post in threaded view
|

Re: Multilingual Tika

Jérôme Charron
>
> I totally am. I've got some PHP skillz and Python skillz
> that I would be willing to throw into the mix here.
>
Yes, I have some basic skillz on Python, and some advanced skillz on PHP,
so I can help you!


> One other thing along these lines I've had in mind for a while:
> how cool would it be to have a CentOS RPM, or Debian pkg
> or something like this and try and get tika into the std Linux
> distributions? Like you install Linux and then you have the
> tika command (maybe a wrapper around tika-app) at your
> disposal? That would be awesome.
>
+1

Jérôme

--------
@jcharron <http://www.twitter.com/jcharron>
http://motre.ch/
http://jcharron.posterous.com/
http://www.shopreflex.fr/
http://www.staragora.com/

<http://feeds.feedburner.com/~r/Bligblagblog/~6/1>
Reply | Threaded
Open this post in threaded view
|

Re: Multilingual Tika

Michael McCandless-2
I would love to see better integration w/ dynamic languages!

I can help on the Python side.  Can we simply wrap Tika's APIs using
jcc, to expose in Python?  Ooh, it's already been done:
http://redmine.djity.net/projects/pythontika/wiki

Mike McCandless

http://blog.mikemccandless.com

2011/11/5 Jérôme Charron <[hidden email]>:

>>
>> I totally am. I've got some PHP skillz and Python skillz
>> that I would be willing to throw into the mix here.
>>
> Yes, I have some basic skillz on Python, and some advanced skillz on PHP,
> so I can help you!
>
>
>> One other thing along these lines I've had in mind for a while:
>> how cool would it be to have a CentOS RPM, or Debian pkg
>> or something like this and try and get tika into the std Linux
>> distributions? Like you install Linux and then you have the
>> tika command (maybe a wrapper around tika-app) at your
>> disposal? That would be awesome.
>>
> +1
>
> Jérôme
>
> --------
> @jcharron <http://www.twitter.com/jcharron>
> http://motre.ch/
> http://jcharron.posterous.com/
> http://www.shopreflex.fr/
> http://www.staragora.com/
>
> <http://feeds.feedburner.com/~r/Bligblagblog/~6/1>
>
Reply | Threaded
Open this post in threaded view
|

Re: Multilingual Tika

Ingo Renner
In reply to this post by Jukka Zitting

Am 05.11.2011 um 01:22 schrieb Jukka Zitting:

Hi Jukka,

> The tika-app jar already helps with that and I know there are people
> using Tika in .NET with IKVM, but it would be nice to see more tighter
> Tika integration also to languages like Python, Ruby, Javascript, Perl
> and PHP. Could we for example make a Ruby Gem out of Tika?

Some more PHP knowhow here...


best
Ingo

--
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

TYPO3
Open Source Enterprise Content Management System
http://typo3.org








Reply | Threaded
Open this post in threaded view
|

Re: Multilingual Tika

Mattmann, Chris A (3010)
Hi Ingo,

Great meeting you at ApacheCon and we'd love to have your PHP
skillz on board! Contribute away :-)

The best start would probably be to file an issue along the lines of TIKA-773 [1],
and get a Tika wrapper for PHP going.

Cheers,
Chris

[1] https://issues.apache.org/jira/browse/TIKA-773

On Nov 14, 2011, at 4:08 AM, Ingo Renner wrote:

>
> Am 05.11.2011 um 01:22 schrieb Jukka Zitting:
>
> Hi Jukka,
>
>> The tika-app jar already helps with that and I know there are people
>> using Tika in .NET with IKVM, but it would be nice to see more tighter
>> Tika integration also to languages like Python, Ruby, Javascript, Perl
>> and PHP. Could we for example make a Ruby Gem out of Tika?
>
> Some more PHP knowhow here...
>
>
> best
> Ingo
>
> --
> Ingo Renner
> TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code
>
> TYPO3
> Open Source Enterprise Content Management System
> http://typo3.org
>
>
>
>
>
>
>
>


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [hidden email]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply | Threaded
Open this post in threaded view
|

Re: Multilingual Tika

Ingo Renner

Am 14.11.2011 um 16:24 schrieb Mattmann, Chris A (388J):

Hi Chris,

> Great meeting you at ApacheCon and we'd love to have your PHP
> skillz on board! Contribute away :-)

Same here, it was a pleasure meeting you!

> The best start would probably be to file an issue along the lines of TIKA-773 [1],
> and get a Tika wrapper for PHP going.

Sorry for the delay, but here it is: https://issues.apache.org/jira/browse/TIKA-807


best
Ingo

--
Ingo Renner
TYPO3 Core Developer, Release Manager TYPO3 4.2, Admin Google Summer of Code

TYPO3
Open Source Enterprise Content Management System
http://typo3.org