Library for extracting text content from binaries

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Library for extracting text content from binaries

Jukka Zitting
Hi,

I'm a committer of the Apache Jackrabbit project, and I've recently
been working on improving the full text indexing support in
Jackrabbit. We've used standard Lucene Java as the embedded full text
search engine in Jackrabbit, but created our own set of parsers for
extracting text content from binary files. So far our parser interface
TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
proposal, TextExtractor, [2] aims for a generic solution that converts
a generic InputStream into a Reader for passing to Lucene Java.

Before coming up with the proposal I tried looking for similar
solutions, but couldn't find any that would have satisfied my
requirement of no external dependencies other than the JRE. Your
o.a.nutch.parse.Parser interface however came quite close, and you
already have an extensive set of existing implementations, so I'd like
to leverage your work with the Parser implementations while finding a
way to avoid the full Nutch and Hadoop dependencies. I believe that
there are a number of other Lucene users who have similar needs.

Thus I'd like to ask if there would be interest in making your Parser
interface and implementations more easily accessible to external
projects, perhaps as a separate library. If  you're interested, I'd be
happy to participate in such an effort.

[1] http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup
[2] http://issues.apache.org/jira/browse/JCR-415


BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [hidden email]
Software craftsmanship, JCR consulting, and Java development
Reply | Threaded
Open this post in threaded view
|

Re: Library for extracting text content from binaries

Jukka Zitting
Hi,

Any interest in this? If not, is there some other Lucene project that
I should approach?

BR,

Jukka Zitting

On 7/18/06, Jukka Zitting <[hidden email]> wrote:

> Hi,
>
> I'm a committer of the Apache Jackrabbit project, and I've recently
> been working on improving the full text indexing support in
> Jackrabbit. We've used standard Lucene Java as the embedded full text
> search engine in Jackrabbit, but created our own set of parsers for
> extracting text content from binary files. So far our parser interface
> TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
> proposal, TextExtractor, [2] aims for a generic solution that converts
> a generic InputStream into a Reader for passing to Lucene Java.
>
> Before coming up with the proposal I tried looking for similar
> solutions, but couldn't find any that would have satisfied my
> requirement of no external dependencies other than the JRE. Your
> o.a.nutch.parse.Parser interface however came quite close, and you
> already have an extensive set of existing implementations, so I'd like
> to leverage your work with the Parser implementations while finding a
> way to avoid the full Nutch and Hadoop dependencies. I believe that
> there are a number of other Lucene users who have similar needs.
>
> Thus I'd like to ask if there would be interest in making your Parser
> interface and implementations more easily accessible to external
> projects, perhaps as a separate library. If  you're interested, I'd be
> happy to participate in such an effort.
>
> [1] http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup
> [2] http://issues.apache.org/jira/browse/JCR-415
>
>
> BR,
>
> Jukka Zitting
>
> --
> Yukatan - http://yukatan.fi/ - [hidden email]
> Software craftsmanship, JCR consulting, and Java development
>
Reply | Threaded
Open this post in threaded view
|

RE: Library for extracting text content from binaries

chrismattmann
Hi Jukka,

  Thanks for your email. Jerome Charron and I proposed a project with a
similar goal in mind that we wanted to dub "Tika". Tika would effectively be
a Lucene sub-project, and would factor out some of the capabilities you
mention below from Nutch, incl:

1. MimeType repository
2. Parser interface and Parser plugins
3. Metadata infrastructure
4. LanguageIdentifier

And a few others. Here is the mailing list thread discussion that we had a
few months back:

http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200604.mbox/%3cc82
[hidden email]%3e

Jerome and I have been quite busy lately, however, and we haven't had a
chance to draft the proposal to send to the Lucene PMC, although Doug (and a
few others) told us that if we garner enough support and feel that the
project would make a significant contribution as it's own Lucene
sub-project, to email the PMC and see what happens. If you're interested in
this idea, maybe it would be a good idea to contact Jerome and I off-list,
and maybe we could get going on a proposal.

Thanks!

Cheers,
  Chris


______________________________________________
Chris A. Mattmann
[hidden email]
Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


> -----Original Message-----
> From: Jukka Zitting [mailto:[hidden email]]
> Sent: Monday, July 24, 2006 11:29 AM
> To: [hidden email]
> Subject: Re: Library for extracting text content from binaries
>
> Hi,
>
> Any interest in this? If not, is there some other Lucene project that
> I should approach?
>
> BR,
>
> Jukka Zitting
>
> On 7/18/06, Jukka Zitting <[hidden email]> wrote:
> > Hi,
> >
> > I'm a committer of the Apache Jackrabbit project, and I've recently
> > been working on improving the full text indexing support in
> > Jackrabbit. We've used standard Lucene Java as the embedded full text
> > search engine in Jackrabbit, but created our own set of parsers for
> > extracting text content from binary files. So far our parser interface
> > TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
> > proposal, TextExtractor, [2] aims for a generic solution that converts
> > a generic InputStream into a Reader for passing to Lucene Java.
> >
> > Before coming up with the proposal I tried looking for similar
> > solutions, but couldn't find any that would have satisfied my
> > requirement of no external dependencies other than the JRE. Your
> > o.a.nutch.parse.Parser interface however came quite close, and you
> > already have an extensive set of existing implementations, so I'd like
> > to leverage your work with the Parser implementations while finding a
> > way to avoid the full Nutch and Hadoop dependencies. I believe that
> > there are a number of other Lucene users who have similar needs.
> >
> > Thus I'd like to ask if there would be interest in making your Parser
> > interface and implementations more easily accessible to external
> > projects, perhaps as a separate library. If  you're interested, I'd be
> > happy to participate in such an effort.
> >
> > [1]
> http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org
> /apache/jackrabbit/core/query/TextFilter.java?view=markup
> > [2] http://issues.apache.org/jira/browse/JCR-415
> >
> >
> > BR,
> >
> > Jukka Zitting
> >
> > --
> > Yukatan - http://yukatan.fi/ - [hidden email]
> > Software craftsmanship, JCR consulting, and Java development
> >

Reply | Threaded
Open this post in threaded view
|

Re: Library for extracting text content from binaries

Michael Wechner
In reply to this post by Jukka Zitting
Jukka Zitting wrote:

> Hi,
>
> Any interest in this?


definitely :-)

Michi

> If not, is there some other Lucene project that
> I should approach?
>
> BR,
>
> Jukka Zitting
>
> On 7/18/06, Jukka Zitting <[hidden email]> wrote:
>
>> Hi,
>>
>> I'm a committer of the Apache Jackrabbit project, and I've recently
>> been working on improving the full text indexing support in
>> Jackrabbit. We've used standard Lucene Java as the embedded full text
>> search engine in Jackrabbit, but created our own set of parsers for
>> extracting text content from binary files. So far our parser interface
>> TextFilter [1] has been Jackrabbit-specific, but my recent refactoring
>> proposal, TextExtractor, [2] aims for a generic solution that converts
>> a generic InputStream into a Reader for passing to Lucene Java.
>>
>> Before coming up with the proposal I tried looking for similar
>> solutions, but couldn't find any that would have satisfied my
>> requirement of no external dependencies other than the JRE. Your
>> o.a.nutch.parse.Parser interface however came quite close, and you
>> already have an extensive set of existing implementations, so I'd like
>> to leverage your work with the Parser implementations while finding a
>> way to avoid the full Nutch and Hadoop dependencies. I believe that
>> there are a number of other Lucene users who have similar needs.
>>
>> Thus I'd like to ask if there would be interest in making your Parser
>> interface and implementations more easily accessible to external
>> projects, perhaps as a separate library. If  you're interested, I'd be
>> happy to participate in such an effort.
>>
>> [1]
>> http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit/src/main/java/org/apache/jackrabbit/core/query/TextFilter.java?view=markup 
>>
>> [2] http://issues.apache.org/jira/browse/JCR-415
>>
>>
>> BR,
>>
>> Jukka Zitting
>>
>> --
>> Yukatan - http://yukatan.fi/ - [hidden email]
>> Software craftsmanship, JCR consulting, and Java development
>>
>


--
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[hidden email]                        [hidden email]
+41 44 272 91 61

Reply | Threaded
Open this post in threaded view
|

Re: Library for extracting text content from binaries

Jukka Zitting
In reply to this post by chrismattmann
Hi,

On 7/24/06, Chris Mattmann <[hidden email]> wrote:
> Thanks for your email. Jerome Charron and I proposed a project with a
> similar goal in mind that we wanted to dub "Tika". Tika would effectively be
> a Lucene sub-project, and would factor out some of the capabilities you
> mention below from Nutch, incl:

Sounds very useful! Jackrabbit could certainly use not only the
generalized parser functionality but also the other proposed features
like language identifiers, etc. Count me in.

> If you're interested in this idea, maybe it would be a good idea to contact Jerome
> and I off-list, and maybe we could get going on a proposal.

OK.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [hidden email]
Software craftsmanship, JCR consulting, and Java development