Integrating Tika with MITLL Text.jl library for language detection

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Integrating Tika with MITLL Text.jl library for language detection

Trevor Claude Lewis
Hi all,

I am Trevor and I am a grad student at USC currently working with Prof.
Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln Lab’s
Text.jl library for language detection.
https://issues.apache.org/jira/browse/TIKA-1696

Since, Text.jl is written in Julia I have created a Julia HTTP Server which
accepts PUT request data and returns the language of the data as a JSON
string.
https://github.com/trevorlewis/csci572dr.git

I have also benchmarked the results of the Julia HTTP Server to identify
language with Tika 1.11 language detector.
https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRemCrbC1miY/edit?usp=sharing

I was also looking at the work done by Ken Krugler on Tika's 2.x branch
language detection and I was planning to fork that project and add the
Text.jl implementation.
https://issues.apache.org/jira/browse/TIKA-1723

I wanted to gather any input and feedback on this project.


Thanks,

Trevor Lewis
[hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Integrating Tika with MITLL Text.jl library for language detection

kkrugler
Hi Trevor,

1. I assume the benchmark was using a pre-2.0 version of Tika, yes?

It would be great to try out the current support in the 2.0 branch, as a comparison with what we had previously.

Also, details on the test corpus used would be useful.

2. I started using the ServiceLoader pattern to support dynamic loading of language detectors

There's a bit more work to move the common support classes (LanguageWriter, etc) from the specific implementation sub-project into core

Once that's done you should be able to try out directly adding your integration with Text.jl

-- Ken

> From: Trevor Claude Lewis
> Sent: February 23, 2016 10:55:46am PST
> To: [hidden email]
> Cc: Mattmann, Chris A (3980); Ramirez, Paul M (398M); [hidden email]
> Subject: Integrating Tika with MITLL Text.jl library for language detection
>
> Hi all,
>
> I am Trevor and I am a grad student at USC currently working with Prof.
> Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln Lab’s
> Text.jl library for language detection.
> https://issues.apache.org/jira/browse/TIKA-1696
>
> Since, Text.jl is written in Julia I have created a Julia HTTP Server which
> accepts PUT request data and returns the language of the data as a JSON
> string.
> https://github.com/trevorlewis/csci572dr.git
>
> I have also benchmarked the results of the Julia HTTP Server to identify
> language with Tika 1.11 language detector.
> https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRemCrbC1miY/edit?usp=sharing
>
> I was also looking at the work done by Ken Krugler on Tika's 2.x branch
> language detection and I was planning to fork that project and add the
> Text.jl implementation.
> https://issues.apache.org/jira/browse/TIKA-1723
>
> I wanted to gather any input and feedback on this project.
>
>
> Thanks,
>
> Trevor Lewis
> [hidden email]

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply | Threaded
Open this post in threaded view
|

Re: Integrating Tika with MITLL Text.jl library for language detection

Mattmann, Chris A (3010)
Thanks Ken.

We are working on bringing in Text.jl and prefer at this point
to work on 1.x branch aka master. I’ve asked Trevor to take a look
at the 1.x branch and pulling your code from 2.x for tika-detect
module into 1.x. Then to look at adding text.jl from MIT-LL as a
corresponding implementation there. It’s a REST-based server that
he set up in Julia that accepts PUT requests. We should be able
to start out with Text.jl and then generalize to any REST service
that will perform language identification later.

You can see the issue from before here:

https://issues.apache.org/jira/browse/TIKA-1696


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [hidden email]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Ken Krugler <[hidden email]>
Date: Tuesday, February 23, 2016 at 11:14 AM
To: "[hidden email]" <[hidden email]>
Cc: jpluser <[hidden email]>, "Ramirez, Paul M (398M)"
<[hidden email]>
Subject: RE: Integrating Tika with MITLL Text.jl library for language
detection

>
>
>
>Hi Trevor,
>
>
>1. I assume the benchmark was using a pre-2.0 version of Tika, yes?
>
>
>It would be great to try out the current support in the 2.0 branch, as a
>comparison with what we had previously.
>
>
>Also, details on the test corpus used would be useful.
>
>
>2. I started using the ServiceLoader pattern to support dynamic loading
>of language detectors
>
>
>There's a bit more work to move the common support classes
>(LanguageWriter, etc) from the specific implementation sub-project into
>core
>
>
>Once that's done you should be able to try out directly adding your
>integration with Text.jl
>
>
>-- Ken
>
>
>________________________________________
>From: Trevor Claude Lewis
>Sent: February 23, 2016 10:55:46am PST
>To:[hidden email]
>Cc: Mattmann, Chris A (3980); Ramirez, Paul M (398M);
>[hidden email]
>Subject: Integrating Tika with MITLL Text.jl library for language
>detection
>
>
>Hi all,
>
>I am Trevor and I am a grad student at USC currently working with Prof.
>Chris Mattmann and Paul Ramirez, on integrating Tika with MIT Lincoln
>Lab’s
>Text.jl library for language detection.
>https://issues.apache.org/jira/browse/TIKA-1696
>
>Since, Text.jl is written in Julia I have created a Julia HTTP Server
>which
>accepts PUT request data and returns the language of the data as a JSON
>string.
>https://github.com/trevorlewis/csci572dr.git
>
>I have also benchmarked the results of the Julia HTTP Server to identify
>language with Tika 1.11 language detector.
>https://docs.google.com/spreadsheets/d/1cW6S2WpiN08pZ3UMVGMyQkO-fotUiUyGRe
>mCrbC1miY/edit?usp=sharing
>
>I was also looking at the work done by Ken Krugler on Tika's 2.x branch
>language detection and I was planning to fork that project and add the
>Text.jl implementation.
>https://issues.apache.org/jira/browse/TIKA-1723
>
>I wanted to gather any input and feedback on this project.
>
>
>Thanks,
>
>Trevor Lewis
>[hidden email]
>
>
>
>
>
>--------------------------
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>