Lius into apache incubator

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Lius into apache incubator

Rida Benjelloun
Hi,
I would like to add Lius framework (http://sourceforge.net/projects/lius/)
to apache incubator. Is there some volontiers to do this job and to
contribute to the developement of this project.

Thanks.

Rida Benjelloun.
Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Otis Gospodnetic-2
Hi Rida,

Some comments in no particular order:

- Looks useful

- This looks like a more expanded version of what Erik and I wrote for LIA, and I know people often ask and use that code, so I know there is a need for a framework that knows how to parse various document formats

- Nutch has some of the document parsing code written in form of plugins.  A few people wanted to decouple that from Nutch in a Tika project: http://code.google.com/p/tika/ .  Not sure what the status is, I think only Jukka Zitting did any work there, but I think the initial idea was never fully funished.  If LIUS joins Lucene, I think some of this duplication should be cleaned up, so we have only one framework for parsing various types of document formats.

- Going through the Incubator is one way to go.  Perhaps another way to get LIUS under Lucene is to just place it under contrib/, say contrib/lius.

- Licensing would have to change to ASL and you would probably also have to send in your ASF CLA.

- Any dependencies on GPL or LGPL or code released under other licenses would have to either be removed, or you'd have to fetch the required Jars at compile/build time.  A few projects under Lucene contrib/ already do that, I believe

- Are there developers who are actively working on LIUS?  Fixing bugs, adding features, keeping up with new versions of dependencies, etc.

Otis
P.S.
Out of curiosity - this is a Laval University project, right?  But you work at DocuLibre?

----- Original Message ----
From: Rida Benjelloun <[hidden email]>
To: [hidden email]; [hidden email]
Sent: Tuesday, January 30, 2007 7:27:28 PM
Subject: Lius into apache incubator

Hi,
I would like to add Lius framework (http://sourceforge.net/projects/lius/)
to apache incubator. Is there some volontiers to do this job and to
contribute to the developement of this project.

Thanks.

Rida Benjelloun.




---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

mark harwood
I would prefer to see a good open-source framework pulling together a
collection of document parsers but which isn't tied directly to Lucene
(that binding would be via *another* project).
If the parser framework extracted document text in a standard
document-and-application-neutral form (XML/Java object?) this could
underpin *any* IR/IE project wanting to make use of the parser
functionality e.g. the GATE framework for example. That would ultimately
make a much more valuable piece of functionality and is the approach
taken by Stellent (used by many search engines, recently purchased by
Oracle).


Cheers
Mark




       
       
               
___________________________________________________________
All new Yahoo! Mail "The new Interface is stunning in its simplicity and ease of use." - PC Magazine
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Erik Hatcher
I'll echo what both Otis and Mark have said.

Lius does look useful, but there are many non-ASL'd dependencies (on  
a quick glance in your lib directory) that would be very difficult to  
resolve with the codebase here at the ASF.

        Erik


On Jan 31, 2007, at 5:19 AM, markharw00d wrote:

> I would prefer to see a good open-source framework pulling together  
> a collection of document parsers but which isn't tied directly to  
> Lucene (that binding would be via *another* project).
> If the parser framework extracted document text in a standard  
> document-and-application-neutral form (XML/Java object?) this could  
> underpin *any* IR/IE project wanting to make use of the parser  
> functionality e.g. the GATE framework for example. That would  
> ultimately make a much more valuable piece of functionality and is  
> the approach taken by Stellent (used by many search engines,  
> recently purchased by Oracle).
>
>
> Cheers
> Mark
>
>
>
>
>
>
>
> ___________________________________________________________ All new  
> Yahoo! Mail "The new Interface is stunning in its simplicity and  
> ease of use." - PC Magazine http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Rida Benjelloun
In reply to this post by Otis Gospodnetic-2
Hi Otis,
Many thanks for your comments, I'm so sorry for this late answer. I will add
lius as lucene contrib and I will change the licence to ASL.
There are some developper contributing to Lius but there are not very
active.
For the question : this is a Laval University project, right?  But you work
at DocuLibre?
I have develpped lius during my study at laval university, I still the copy
right owner for this projet, so I can change the licence to ASL without any
problem. Lius has been used in serveral projet at Laval university and I
deceded to hoste it in Laval.
I work at Laval and at Doculibre.

Tika is a really good projet and I'm really interested to join it.

Regards.


On 1/31/07, Otis Gospodnetic <[hidden email]> wrote:

>
> Hi Rida,
>
> Some comments in no particular order:
>
> - Looks useful
>
> - This looks like a more expanded version of what Erik and I wrote for
> LIA, and I know people often ask and use that code, so I know there is a
> need for a framework that knows how to parse various document formats
>
> - Nutch has some of the document parsing code written in form of
> plugins.  A few people wanted to decouple that from Nutch in a Tika project:
> http://code.google.com/p/tika/ .  Not sure what the status is, I think
> only Jukka Zitting did any work there, but I think the initial idea was
> never fully funished.  If LIUS joins Lucene, I think some of this
> duplication should be cleaned up, so we have only one framework for parsing
> various types of document formats.
>
> - Going through the Incubator is one way to go.  Perhaps another way to
> get LIUS under Lucene is to just place it under contrib/, say contrib/lius.
>
> - Licensing would have to change to ASL and you would probably also have
> to send in your ASF CLA.
>
> - Any dependencies on GPL or LGPL or code released under other licenses
> would have to either be removed, or you'd have to fetch the required Jars at
> compile/build time.  A few projects under Lucene contrib/ already do that, I
> believe
>
> - Are there developers who are actively working on LIUS?  Fixing bugs,
> adding features, keeping up with new versions of dependencies, etc.
>
> Otis
> P.S.
> Out of curiosity - this is a Laval University project, right?  But you
> work at DocuLibre?
>
> ----- Original Message ----
> From: Rida Benjelloun <[hidden email]>
> To: [hidden email]; [hidden email]
> Sent: Tuesday, January 30, 2007 7:27:28 PM
> Subject: Lius into apache incubator
>
> Hi,
> I would like to add Lius framework (http://sourceforge.net/projects/lius/)
> to apache incubator. Is there some volontiers to do this job and to
> contribute to the developement of this project.
>
> Thanks.
>
> Rida Benjelloun.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

mark harwood
In reply to this post by Rida Benjelloun
Hi Rida,
I've been talking with Jukka Zitting (involved in Nutch) about parsing/Tika and we started to sketch out some project objectives on the Wiki over there which may be of interest:
 http://code.google.com/p/tika/w/list

I recently did a round-up of the main open source projects which maintain their own custom document parsing framework and counted over 17. There was a fair mix of approaches and  parser choices but a lot of commonality suggesting a common project is possible/useful. The above WIKI sketchings were an attempt to outline the requirements for such a common project and also were questioning where best to host this.

>>Tika is a really good projet and I'm really interested to join it.
I suspect one of the main differences between Lius and Tika's current objectives is that Tika aims to be independent of any application which consumes  the parsed data (e.g. not tied to Lucene indexing classes). That said, I don't imagine it is too hard to decouple Lius's parser logic from it's indexing logic.


Cheers,
Mark



----- Original Message ----
From: Rida Benjelloun <[hidden email]>
To: [hidden email]
Sent: Wednesday, 28 February, 2007 4:46:36 PM
Subject: Re: Lius into apache incubator

Hi Otis,
Many thanks for your comments, I'm so sorry for this late answer. I will add
lius as lucene contrib and I will change the licence to ASL.
There are some developper contributing to Lius but there are not very
active.
For the question : this is a Laval University project, right?  But you work
at DocuLibre?
I have develpped lius during my study at laval university, I still the copy
right owner for this projet, so I can change the licence to ASL without any
problem. Lius has been used in serveral projet at Laval university and I
deceded to hoste it in Laval.
I work at Laval and at Doculibre.

Tika is a really good projet and I'm really interested to join it.

Regards.


On 1/31/07, Otis Gospodnetic <[hidden email]> wrote:

>
> Hi Rida,
>
> Some comments in no particular order:
>
> - Looks useful
>
> - This looks like a more expanded version of what Erik and I wrote for
> LIA, and I know people often ask and use that code, so I know there is a
> need for a framework that knows how to parse various document formats
>
> - Nutch has some of the document parsing code written in form of
> plugins.  A few people wanted to decouple that from Nutch in a Tika project:
> http://code.google.com/p/tika/ .  Not sure what the status is, I think
> only Jukka Zitting did any work there, but I think the initial idea was
> never fully funished.  If LIUS joins Lucene, I think some of this
> duplication should be cleaned up, so we have only one framework for parsing
> various types of document formats.
>
> - Going through the Incubator is one way to go.  Perhaps another way to
> get LIUS under Lucene is to just place it under contrib/, say contrib/lius.
>
> - Licensing would have to change to ASL and you would probably also have
> to send in your ASF CLA.
>
> - Any dependencies on GPL or LGPL or code released under other licenses
> would have to either be removed, or you'd have to fetch the required Jars at
> compile/build time.  A few projects under Lucene contrib/ already do that, I
> believe
>
> - Are there developers who are actively working on LIUS?  Fixing bugs,
> adding features, keeping up with new versions of dependencies, etc.
>
> Otis
> P.S.
> Out of curiosity - this is a Laval University project, right?  But you
> work at DocuLibre?
>
> ----- Original Message ----
> From: Rida Benjelloun <[hidden email]>
> To: [hidden email]; [hidden email]
> Sent: Tuesday, January 30, 2007 7:27:28 PM
> Subject: Lius into apache incubator
>
> Hi,
> I would like to add Lius framework (http://sourceforge.net/projects/lius/)
> to apache incubator. Is there some volontiers to do this job and to
> contribute to the developement of this project.
>
> Thanks.
>
> Rida Benjelloun.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>





       
       
               
___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Otis Gospodnetic-2
In reply to this post by Rida Benjelloun
Sounds like Lius & Tika would make a nice couple, Rida.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: mark harwood <[hidden email]>
To: [hidden email]
Sent: Wednesday, February 28, 2007 1:06:16 PM
Subject: Re: Lius into apache incubator

Hi Rida,
I've been talking with Jukka Zitting (involved in Nutch) about parsing/Tika and we started to sketch out some project objectives on the Wiki over there which may be of interest:
 http://code.google.com/p/tika/w/list

I recently did a round-up of the main open source projects which maintain their own custom document parsing framework and counted over 17. There was a fair mix of approaches and  parser choices but a lot of commonality suggesting a common project is possible/useful. The above WIKI sketchings were an attempt to outline the requirements for such a common project and also were questioning where best to host this.

>>Tika is a really good projet and I'm really interested to join it.
I suspect one of the main differences between Lius and Tika's current objectives is that Tika aims to be independent of any application which consumes  the parsed data (e.g. not tied to Lucene indexing classes). That said, I don't imagine it is too hard to decouple Lius's parser logic from it's indexing logic.


Cheers,
Mark



----- Original Message ----
From: Rida Benjelloun <[hidden email]>
To: [hidden email]
Sent: Wednesday, 28 February, 2007 4:46:36 PM
Subject: Re: Lius into apache incubator

Hi Otis,
Many thanks for your comments, I'm so sorry for this late answer. I will add
lius as lucene contrib and I will change the licence to ASL.
There are some developper contributing to Lius but there are not very
active.
For the question : this is a Laval University project, right?  But you work
at DocuLibre?
I have develpped lius during my study at laval university, I still the copy
right owner for this projet, so I can change the licence to ASL without any
problem. Lius has been used in serveral projet at Laval university and I
deceded to hoste it in Laval.
I work at Laval and at Doculibre.

Tika is a really good projet and I'm really interested to join it.

Regards.


On 1/31/07, Otis Gospodnetic <[hidden email]> wrote:

>
> Hi Rida,
>
> Some comments in no particular order:
>
> - Looks useful
>
> - This looks like a more expanded version of what Erik and I wrote for
> LIA, and I know people often ask and use that code, so I know there is a
> need for a framework that knows how to parse various document formats
>
> - Nutch has some of the document parsing code written in form of
> plugins.  A few people wanted to decouple that from Nutch in a Tika project:
> http://code.google.com/p/tika/ .  Not sure what the status is, I think
> only Jukka Zitting did any work there, but I think the initial idea was
> never fully funished.  If LIUS joins Lucene, I think some of this
> duplication should be cleaned up, so we have only one framework for parsing
> various types of document formats.
>
> - Going through the Incubator is one way to go.  Perhaps another way to
> get LIUS under Lucene is to just place it under contrib/, say contrib/lius.
>
> - Licensing would have to change to ASL and you would probably also have
> to send in your ASF CLA.
>
> - Any dependencies on GPL or LGPL or code released under other licenses
> would have to either be removed, or you'd have to fetch the required Jars at
> compile/build time.  A few projects under Lucene contrib/ already do that, I
> believe
>
> - Are there developers who are actively working on LIUS?  Fixing bugs,
> adding features, keeping up with new versions of dependencies, etc.
>
> Otis
> P.S.
> Out of curiosity - this is a Laval University project, right?  But you
> work at DocuLibre?
>
> ----- Original Message ----
> From: Rida Benjelloun <[hidden email]>
> To: [hidden email]; [hidden email]
> Sent: Tuesday, January 30, 2007 7:27:28 PM
> Subject: Lius into apache incubator
>
> Hi,
> I would like to add Lius framework (http://sourceforge.net/projects/lius/)
> to apache incubator. Is there some volontiers to do this job and to
> contribute to the developement of this project.
>
> Thanks.
>
> Rida Benjelloun.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>





   
   
       
___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]





---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Jukka Zitting
Hi,

I am interested in a Lius/Tika project that could be used not only with Lucene. As mentioned by Mark, there are a number of related efforts which leads me to believe a application-independent content analysis/parsing tool would be very helpful for many users.

I'd like to propose taking the project to the Apache Incubator to better attract interest also from outside Lucene.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Rida Benjelloun
Hi,
You could actually use Lius as text extraction API, I have implement for
each Indexer a method that allows you to get the String content of the
Document.
Lius could be used as a starting point of Tika project, if Tika committers
are interested on it. We can also as mark said decouple Lius's parser logic
from it's indexing logic.
Taking the project into Apache incubator could be also interesting, to get
more people involved on it.

My goal is to join our effort to build a framework for text extraction.
Here is an example of text extraction with lius :

LiusConfig lc =
LiusConfigBuilder.getSingletonInstance().getLiusConfig(liusConfigPathString);

Indexer indexer = IndexerFactory.getIndexer(documentToIndex, lc);
String text = Indexer.getContent();


On 3/1/07, Jukka Zitting <[hidden email]> wrote:

>
>
> Hi,
>
> I am interested in a Lius/Tika project that could be used not only with
> Lucene. As mentioned by Mark, there are a number of related efforts which
> leads me to believe a application-independent content analysis/parsing
> tool
> would be very helpful for many users.
>
> I'd like to propose taking the project to the Apache Incubator to better
> attract interest also from outside Lucene.
>
> BR,
>
> Jukka Zitting
>
> --
> View this message in context:
> http://www.nabble.com/Lius-into-apache-incubator-tf3145937.html#a9247508
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Jukka Zitting
Hi,

On 3/1/07, Rida Benjelloun <[hidden email]> wrote:
> Lius could be used as a starting point of Tika project, if Tika committers
> are interested on it. We can also as mark said decouple Lius's parser logic
> from it's indexing logic.

I'm very interested in doing that. Another very useful codebase, among
others, would be the existing parser framework in the Nutch project.

> Taking the project into Apache incubator could be also interesting, to get
> more people involved on it.

Exactly. I'd like to avoid starting just yet another codebase, and
focus more on bringing the best parts (both code and ideas) of the
existing projects together. The community-building focus of the
Incubator would likely help with that. Another aspect that would
benefit from the Incubator scrutiny are the legal implications of
pulling together multiple document parser libraries under various
different licenses.

Would there be interest within the Lucene PMC in sponsoring a proposal
along such lines? I can volunteer to put together the proposal and act
as the champion and mentor of the project.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Grant Ingersoll-2
Is the Droids lab at all related to that parsing project in Nutch?  
There seems to be several efforts that are related here that could  
probably make for a nice new project under Lucene, IMO.  They all  
seem to have to do with  getting and preparing text for processing by  
some type of consumer of text.

I sometimes wonder if the Analysis stuff in Lucene proper would  
benefit from moving out of core too, but I'm not sure what it would  
look like just yet and it is nice having it "optimized" for Lucene  
versus having to support other types of analysis phases.


Just my two cents,
Grant


On Mar 1, 2007, at 11:42 AM, Jukka Zitting wrote:

> Hi,
>
> On 3/1/07, Rida Benjelloun <[hidden email]> wrote:
>> Lius could be used as a starting point of Tika project, if Tika  
>> committers
>> are interested on it. We can also as mark said decouple Lius's  
>> parser logic
>> from it's indexing logic.
>
> I'm very interested in doing that. Another very useful codebase, among
> others, would be the existing parser framework in the Nutch project.
>
>> Taking the project into Apache incubator could be also  
>> interesting, to get
>> more people involved on it.
>
> Exactly. I'd like to avoid starting just yet another codebase, and
> focus more on bringing the best parts (both code and ideas) of the
> existing projects together. The community-building focus of the
> Incubator would likely help with that. Another aspect that would
> benefit from the Incubator scrutiny are the legal implications of
> pulling together multiple document parser libraries under various
> different licenses.
>
> Would there be interest within the Lucene PMC in sponsoring a proposal
> along such lines? I can volunteer to put together the proposal and act
> as the champion and mentor of the project.
>
> BR,
>
> Jukka Zitting
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Rida Benjelloun
In reply to this post by Jukka Zitting
Hi,
On 3/1/07, Jukka Zitting <[hidden email]> wrote:

>
> Hi,
>
> On 3/1/07, Rida Benjelloun <[hidden email]> wrote:
> > Lius could be used as a starting point of Tika project, if Tika
> committers
> > are interested on it. We can also as mark said decouple Lius's parser
> logic
> > from it's indexing logic.
>
> I'm very interested in doing that. Another very useful codebase, among
> others, would be the existing parser framework in the Nutch project.


-->> I agree


> Taking the project into Apache incubator could be also interesting, to get
> > more people involved on it.
>
> Exactly. I'd like to avoid starting just yet another codebase, and
> focus more on bringing the best parts (both code and ideas) of the
> existing projects together. The community-building focus of the
> Incubator would likely help with that. Another aspect that would
> benefit from the Incubator scrutiny are the legal implications of
> pulling together multiple document parser libraries under various
> different licenses.
>
> Would there be interest within the Lucene PMC in sponsoring a proposal
> along such lines? I can volunteer to put together the proposal and act
> as the champion and mentor of the project.


-- >> We can put together the proposal and you can be the mentor of the
project.

BR,
>
> Jukka Zitting
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
-----------------------------------------------------------
Rida Benjelloun, M.S.I., M.B.A.
Président directeur général
DocuLibre inc.
Téléphone : (418) 262-3222
Site Web : http://www.doculibre.com
Courriel : [hidden email]
-----------------------------------------------------------
Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Jukka Zitting
In reply to this post by Grant Ingersoll-2
Hi,

On 3/1/07, Grant Ingersoll <[hidden email]> wrote:
> Is the Droids lab at all related to that parsing project in Nutch?

Partly, yes. I've been looking at Droids and so far I think it's main
focus has been on the crawling part rather than on the analysis of
retrieved content. A generic content analysis toolkit would likely be
a great companion for Droids. In fact I was earlier contemplating
about starting a related effort in Apache Labs (see
http://issues.apache.org/jira/browse/JCR-728), but there seems to be
enough demand for such functionality that a more full-fledged project
might be better.

> There seems to be several efforts that are related here that could
> probably make for a nice new project under Lucene, IMO.  They all
> seem to have to do with getting and preparing text for processing by
> some type of consumer of text.

Exactly. It would be great to see some consolidation of efforts.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Jukka Zitting
In reply to this post by Rida Benjelloun
Hi,

On 3/1/07, Rida Benjelloun <[hidden email]> wrote:
> On 3/1/07, Jukka Zitting <[hidden email]> wrote:
> > Would there be interest within the Lucene PMC in sponsoring a proposal
> > along such lines? I can volunteer to put together the proposal and act
> > as the champion and mentor of the project.
>
> -- >> We can put together the proposal and you can be the mentor of the
> project.

See below for a quick first draft (filled with TODOs).

PS. Will people mind if we use this list for fleshing out the details?
I've created a Google Group for Tika where we could also take the
discussion if that's preferred.

BR,

Jukka Zitting


Tika Proposal
=============

This is an early draft of a possible proposal for a Tika project
within the Apache Incubator. See
http://incubator.apache.org/guides/proposal.html for a description of
the propsal template.

Abstract
--------

Tika is a toolkit for detecting and extracting metadata and text
content from various documents using existing parser libraries.

Proposal
--------

The Tika content analysis toolkit will include features for detecting
the content types, character encodings, languages, and other
characteristics of existing documents and for extracting structured
text content from the documents.

The toolkit is targeted especially for search engines and other
content indexing and analysis tools, but will be useful also for other
applications that need to extract meaningful information from
documents that might be presented as nothing else than binary streams.

Instead of implementing it's own document parsers, Tika will use
existing parser libraries like Jakarta POI and PDFBox.

Background
----------

The need for tools that automatically analyze and index content is
increasing as ever more information becomes available.

TODO: Discuss the various related projects and the lack of a common
analysis toolkit. Note how many of the existing tools have grown as
ad-hoc solutions to specific needs, and are often tightly bound to a
specific application or a parser library.

Rationale
---------

TODO

Initial Goals
-------------

TODO

Current Status
--------------

TODO

Meritocracy
-----------

TODO

Community
---------

TODO

Core Developers
---------------

TODO

Alignment
---------

TODO

Known Risks
-----------

TODO: There has been on-and-off interest in something like this for
quite a while already. How can we make sure that the current increase
in interest doesn't fade away?

Orphaned products
-----------------

TODO: See the comment above

Inexperience with Open Source
-----------------------------

TODO: Many of the interested participants have open source background.

Homogenous Developers
---------------------

TODO: There is no central company behind the proposal.

Reliance on Salaried Developers
-------------------------------

TODO: Some of us are salaried for this, other's are not.

Relationships with Other Apache Products
----------------------------------------

TODO: Lucene, Nutch, Jackrabbit, Droids, ...

A Excessive Fascination with the Apache Brand
---------------------------------------------

TODO

Documentation
-------------

TODO

Initial Source
--------------

TODO: Tika, Lius, Nutch?, ...

Source and Intellectual Property Submission Plan
------------------------------------------------

TODO

External Dependencies
---------------------

TODO: Some of the potential parser libraries will be GPL-licensed or
otherwise troublesome for an ASF project. How to best handle such
cases?

Cryptography
------------

TODO: Some of the document formats are involve encryption and features
like DRM. While Tika itself will probably not include any
cryptographic code, the parser dependencies will most likely include
such code.

Required Resources
------------------

Mailing lists

  * [hidden email]

Subversion Directory

  * https://svn.apache.org/repos/asf/incubator/tika

Issue Tracking

  * JIRA TIKA

Other Resources

  * none

Initial Committers
------------------

TODO

Affiliations
------------

TODO

Sponsors
--------

Champion

TODO (I can volunteer)

Nominated Mentors

TODO (Three mentors is the recommendation, I can volunteer as one)

Sponsoring Entity

TODO (Apache Lucene?)

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Doug Cutting
Jukka Zitting wrote:
> PS. Will people mind if we use this list for fleshing out the details?
> I've created a Google Group for Tika where we could also take the
> discussion if that's preferred.

I think the Incubator Wiki would be the best place for this.

http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles

Interested folks could subscribe to the proposal page.  You could
announce the proposal page on several lists.  Will that work for you?

Also, I can probably help as a mentor if needed.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Jukka Zitting
Hi,

On 3/1/07, Doug Cutting <[hidden email]> wrote:

> Jukka Zitting wrote:
> > PS. Will people mind if we use this list for fleshing out the details?
> > I've created a Google Group for Tika where we could also take the
> > discussion if that's preferred.
>
> I think the Incubator Wiki would be the best place for this.
>
> http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles
>
> Interested folks could subscribe to the proposal page.  You could
> announce the proposal page on several lists.  Will that work for you?

Sounds good. I uploaded the early draft to
http://wiki.apache.org/incubator/TikaProposal, I'll announce it in a
moment.

> Also, I can probably help as a mentor if needed.

Cool, thanks!

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

Rida Benjelloun
In reply to this post by Doug Cutting
Hi,
Thanks Doug, I think that your help will be very appricieted as a mentor.
Regards.

On 3/1/07, Doug Cutting <[hidden email]> wrote:

>
> Jukka Zitting wrote:
> > PS. Will people mind if we use this list for fleshing out the details?
> > I've created a Google Group for Tika where we could also take the
> > discussion if that's preferred.
>
> I think the Incubator Wiki would be the best place for this.
>
>
> http://wiki.apache.org/incubator/?action=fullsearch&value=proposal&titlesearch=Titles
>
> Interested folks could subscribe to the proposal page.  You could
> announce the proposal page on several lists.  Will that work for you?
>
> Also, I can probably help as a mentor if needed.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Lius into apache incubator

thorsten
In reply to this post by Rida Benjelloun
Renaud forwarded me the thread and I just subscribed,
so apologize for not proper responding.

Thanks Renaud for the headsup.

> Hi,
>
> On 3/1/07, Grant Ingersoll <[hidden email]> wrote:
> > Is the Droids lab at all related to that parsing project in Nutch?
>
> Partly, yes. I've been looking at Droids and so far I think it's main
> focus has been on the crawling part rather than on the analysis of
> retrieved content.

Yes, droids should be a generic crawler framework. I took Nutch and
ripped out the plugin/extension point framework and wrote some PoC plugins.
I changed many thinks to make the code simpler so from Nutch original code
is not much left. Further I am using ivy for dependencies management for
the core and the plugins.

The first crawler is not close to the one from nutch but via plugins one
could implement the same functionality (but there is ATM no interest on
Nutch). The implemented crawler x-m02y07 is more (very basic for now)
wget style -> request url, extract links and save the page to disk.

> A generic content analysis toolkit would likely be
> a great companion for Droids.

Yes indeed. I am ATM playing with
http://simile.mit.edu/repository/crowbar/trunk/ 

Stefano pointed me to it and it is very interesting since the idea is
to use a gecko based browser as server to browse a page and let the
browser analyze the page. Very interesting since it enables crawler to
index web2 components such as ajax.

http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser

The core for any crawler is the link recognition where we can go different routes. In
the short term we can enhance the parse-html droids plugin with neko
html (similar route as nutch is going) but in the long run we should try
to incorporate a virtual browser like Stefano pointed out on the labs
ml.

>  In fact I was earlier contemplating
> about starting a related effort in Apache Labs (see
> http://issues.apache.org/jira/browse/JCR-728),

That seems more to aim to close the mime type gap that we have ATM and I
think labs would be the right place for this.

>  but there seems to be
> enough demand for such functionality that a more full-fledged project
> might be better.

Maybe you are interested in starting some plugins in Droids and as soon
we got some community around the code we can request for incubation.

Some Forrest folks also expressed their interest in Droids.
Actually Forrest/cocoon was one of the main reason I started it.
The other was Solr.


>
> > There seems to be several efforts that are related here that could
> > probably make for a nice new project under Lucene, IMO.  They all
> > seem to have to do with getting and preparing text for processing by
> > some type of consumer of text.
>
> Exactly. It would be great to see some consolidation of efforts.
>

The grant advantage of labs is that all apache committer have write
access meaning cross project efforts like this one are perfect to get
started in labs. If enough people get attracted the lab get promoted.
When a lab is promoted, the files are moved over to the incubation area.
http://labs.apache.org/bylaws.html

Looking forward to see you on
[hidden email]

salu2
--
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java & XML                consulting, training and solutions


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]