Questions

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions

Grant Ingersoll-2
Hey Gang,

I was wondering if you had a todo list or something somewhere?  I  
have been loosely following the discussions here and see the general  
outline of what the goals are here: http://www.mail-archive.com/tika- 
[hidden email]/msg00024.html (Tika discussions in Amsterdam)

Here's where I am at:  I am considering extracting the Nutch parsing  
plugins for a project I am undertaking and wrapping them for my own  
purposes, but knowing Tika is around, I would just as soon do this in  
the context of Tika, or at least try to help out that way and have it  
become a part of Tika.  I have not looked at Lius yet.  I guess I am  
wondering if you have some interfaces in mind that you want to hook  
into, or is the Nutch model (or Lius model) already going to serve as  
the main model?  I pretty much think the Nutch model has everything I  
need at the moment, but I don't want to carry around the whole set of  
Nutch dependencies.  I am not worried about content detection at this  
point so much as extraction.

Is the plan to adopt a similar plugin approach as Nutch?

So, I guess the question is what can I do at this point to help?  
Should I just go ahead with my needs and then give it back as a patch  
and you can decide what to do with it from there?  I  am in somewhat  
of a hurry to get the basics working in the next week or so.

Also, anyone have any recommendations for parsing various mail  
repositories like Outlook, Mac Mail (which I think is mbox), etc.?

Cheers,
Grant


Reply | Threaded
Open this post in threaded view
|

Re: Questions

Grant Ingersoll-2
Also, please feel free to tell me I am getting to far ahead of  
things...  :-)


On Jun 29, 2007, at 4:57 PM, Grant Ingersoll wrote:

> Hey Gang,
>
> I was wondering if you had a todo list or something somewhere?  I  
> have been loosely following the discussions here and see the  
> general outline of what the goals are here: http://www.mail- 
> archive.com/[hidden email]/msg00024.html (Tika  
> discussions in Amsterdam)
>
> Here's where I am at:  I am considering extracting the Nutch  
> parsing plugins for a project I am undertaking and wrapping them  
> for my own purposes, but knowing Tika is around, I would just as  
> soon do this in the context of Tika, or at least try to help out  
> that way and have it become a part of Tika.  I have not looked at  
> Lius yet.  I guess I am wondering if you have some interfaces in  
> mind that you want to hook into, or is the Nutch model (or Lius  
> model) already going to serve as the main model?  I pretty much  
> think the Nutch model has everything I need at the moment, but I  
> don't want to carry around the whole set of Nutch dependencies.  I  
> am not worried about content detection at this point so much as  
> extraction.
>
> Is the plan to adopt a similar plugin approach as Nutch?
>
> So, I guess the question is what can I do at this point to help?  
> Should I just go ahead with my needs and then give it back as a  
> patch and you can decide what to do with it from there?  I  am in  
> somewhat of a hurry to get the basics working in the next week or so.
>
> Also, anyone have any recommendations for parsing various mail  
> repositories like Outlook, Mac Mail (which I think is mbox), etc.?
>
> Cheers,
> Grant
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Questions

mark-268
In reply to this post by Grant Ingersoll-2
>>Also, anyone have any recommendations for parsing various mail
>>repositories like Outlook, Mac Mail (which I think is mbox), etc.?

"mstor" is a JavaMail implementation which should do a good job of handling
mbox parsing for you. I've used it but looks like the license isn't Apache
:( http://mstor.sourceforge.net/

I'm not up to speed with latest Tika developments for which I must
apologise - I've been buried in other work since it's inception.

Cheers,
Mark.

----- Original Message -----
From: "Grant Ingersoll" <[hidden email]>
To: <[hidden email]>
Sent: Friday, June 29, 2007 9:57 PM
Subject: Questions


> Hey Gang,
>
> I was wondering if you had a todo list or something somewhere?  I  have
> been loosely following the discussions here and see the general  outline
> of what the goals are here: http://www.mail-archive.com/tika- 
> [hidden email]/msg00024.html (Tika discussions in Amsterdam)
>
> Here's where I am at:  I am considering extracting the Nutch parsing
> plugins for a project I am undertaking and wrapping them for my own
> purposes, but knowing Tika is around, I would just as soon do this in  the
> context of Tika, or at least try to help out that way and have it  become
> a part of Tika.  I have not looked at Lius yet.  I guess I am  wondering
> if you have some interfaces in mind that you want to hook  into, or is the
> Nutch model (or Lius model) already going to serve as  the main model?  I
> pretty much think the Nutch model has everything I  need at the moment,
> but I don't want to carry around the whole set of  Nutch dependencies.  I
> am not worried about content detection at this  point so much as
> extraction.
>
> Is the plan to adopt a similar plugin approach as Nutch?
>
> So, I guess the question is what can I do at this point to help?   Should
> I just go ahead with my needs and then give it back as a patch  and you
> can decide what to do with it from there?  I  am in somewhat  of a hurry
> to get the basics working in the next week or so.
>
> Also, anyone have any recommendations for parsing various mail
> repositories like Outlook, Mac Mail (which I think is mbox), etc.?
>
> Cheers,
> Grant
>
>
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Questions

Jukka Zitting
In reply to this post by Grant Ingersoll-2
Hi,

On 6/29/07, Grant Ingersoll <[hidden email]> wrote:
> I was wondering if you had a todo list or something somewhere?  I
> have been loosely following the discussions here and see the general
> outline of what the goals are here: http://www.mail-archive.com/tika-
> [hidden email]/msg00024.html (Tika discussions in Amsterdam)

That's probably the most complete todo list lookalike for now. There's
some gradual progress going on, but we are still in a formative phase
where not even some basic practices on svn use, etc. have emerged, so
I wouldn't put too much weight on any single message

> Here's where I am at:  I am considering extracting the Nutch parsing
> plugins for a project I am undertaking and wrapping them for my own
> purposes, but knowing Tika is around, I would just as soon do this in
> the context of Tika, or at least try to help out that way and have it
> become a part of Tika.  I have not looked at Lius yet.  I guess I am
> wondering if you have some interfaces in mind that you want to hook
> into, or is the Nutch model (or Lius model) already going to serve as
> the main model?  I pretty much think the Nutch model has everything I
> need at the moment, but I don't want to carry around the whole set of
> Nutch dependencies.  I am not worried about content detection at this
> point so much as extraction.
>
> Is the plan to adopt a similar plugin approach as Nutch?

There seems to be a general consensus that the existing solutions like
Nutch are a good starting point but need some modifications before
they satisfy all the goals of Tika, but few specific decisions have
yet been made.

> So, I guess the question is what can I do at this point to help?
> Should I just go ahead with my needs and then give it back as a patch
> and you can decide what to do with it from there?  I  am in somewhat
> of a hurry to get the basics working in the next week or so.

I would recommend that you just go forward with your plan and don't
wait for us. :-) One thing you may want to take a look at is "Lius
Lite" in the Tika issue tracker, that contains a trimmed version of
the Lius framework, but if you already are familiar with Nutch then it
probably makes more sense to stick with that. I believe the eventual
Tika framework will end up incorporating concepts from both Nutch and
Lius (among others).

It would be certainly interesting to see what you end up with and
perhaps hear a brief summary of the main issues and concerns you
encountered. This is exactly the sort of stuff that Tika should
support, so your contributions would be very much welcome!

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Questions

Grant Ingersoll-2

On Jun 29, 2007, at 6:36 PM, Jukka Zitting wrote:

>
> I would recommend that you just go forward with your plan and don't
> wait for us. :-) One thing you may want to take a look at is "Lius
> Lite" in the Tika issue tracker, that contains a trimmed version of
> the Lius framework, but if you already are familiar with Nutch then it
> probably makes more sense to stick with that. I believe the eventual
> Tika framework will end up incorporating concepts from both Nutch and
> Lius (among others).
>
> It would be certainly interesting to see what you end up with and
> perhaps hear a brief summary of the main issues and concerns you
> encountered. This is exactly the sort of stuff that Tika should
> support, so your contributions would be very much welcome!
>

Well, you will definitely get that chance at some point time.

My main concern w/ extracting Nutch is all the dependencies on  
Hadoop, etc.  But it does seem like the shortest path for me.

-Grant
Reply | Threaded
Open this post in threaded view
|

Re: Questions

Bertrand Delacretaz
On 6/30/07, Grant Ingersoll <[hidden email]> wrote:

> ...My main concern w/ extracting Nutch is all the dependencies on
> Hadoop, etc.  But it does seem like the shortest path for me....

I've mentioned Tika to a few colleagues lately, and one thing that
comes up often is that there are many document/format parsing
libraries around, which should ideally be usable as Tika plugins with
as little changes as possible.

But these libraries' dependencies are all around the place, and
probably conflicting in many cases.

It might be good to take that into account in the design of Tika, and
use solid classloading and isolation mechanisms. OSGI comes to mind,
assuming it doesn't bloat the whole thing.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: Questions

Carsten Ziegeler
Bertrand Delacretaz wrote:

> On 6/30/07, Grant Ingersoll <[hidden email]> wrote:
>
>> ...My main concern w/ extracting Nutch is all the dependencies on
>> Hadoop, etc.  But it does seem like the shortest path for me....
>
> I've mentioned Tika to a few colleagues lately, and one thing that
> comes up often is that there are many document/format parsing
> libraries around, which should ideally be usable as Tika plugins with
> as little changes as possible.
>
> But these libraries' dependencies are all around the place, and
> probably conflicting in many cases.
>
> It might be good to take that into account in the design of Tika, and
> use solid classloading and isolation mechanisms. OSGI comes to mind,
> assuming it doesn't bloat the whole thing.
>
Yes, in many cases a solid classloading mechanism is a must and OSGi
definitly implements this properly.
I think, we can leave this open (= do not need to require OSGi) if we
have an open way of registering the plugins. Registering in an OSGi
environment might then be slightly different compared to registering in
a non OSGi environmnent. Of course, using the latter one might result in
classloading problems :) But then it's up to the developer to decide in
which environment tika should run with all the pros and cons that come
with this decision.

Carsten

Reply | Threaded
Open this post in threaded view
|

Re: Questions

Jukka Zitting
Hi,

On 7/2/07, Carsten Ziegeler <[hidden email]> wrote:

> Bertrand Delacretaz wrote:
> > I've mentioned Tika to a few colleagues lately, and one thing that
> > comes up often is that there are many document/format parsing
> > libraries around, which should ideally be usable as Tika plugins with
> > as little changes as possible.
> >
> > But these libraries' dependencies are all around the place, and
> > probably conflicting in many cases.
> >
> > It might be good to take that into account in the design of Tika, and
> > use solid classloading and isolation mechanisms. OSGI comes to mind,
> > assuming it doesn't bloat the whole thing.
> >
> Yes, in many cases a solid classloading mechanism is a must and OSGi
> definitly implements this properly.
> I think, we can leave this open (= do not need to require OSGi) if we
> have an open way of registering the plugins. Registering in an OSGi
> environment might then be slightly different compared to registering in
> a non OSGi environmnent. Of course, using the latter one might result in
> classloading problems :) But then it's up to the developer to decide in
> which environment tika should run with all the pros and cons that come
> with this decision.

+1 I think that the core Tika framework should be very lightweigth and
easily composable in various different environments. I even think that
we shouldn't mandate any "official" configuration or composition
mechanism. We may have some simple implementation as the default, but
it should be possible to use things like Spring or OSGi or whatever to
manage more complex scenarios.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: Questions

chrismattmann
+1 here too. I would love to have a light-weight plugin loading mechanism,
and like the idea of not having to pick a single mechanism.

Cheers,
  Chris



On 7/2/07 4:38 AM, "Jukka Zitting" <[hidden email]> wrote:

> Hi,
>
> On 7/2/07, Carsten Ziegeler <[hidden email]> wrote:
>> Bertrand Delacretaz wrote:
>>> I've mentioned Tika to a few colleagues lately, and one thing that
>>> comes up often is that there are many document/format parsing
>>> libraries around, which should ideally be usable as Tika plugins with
>>> as little changes as possible.
>>>
>>> But these libraries' dependencies are all around the place, and
>>> probably conflicting in many cases.
>>>
>>> It might be good to take that into account in the design of Tika, and
>>> use solid classloading and isolation mechanisms. OSGI comes to mind,
>>> assuming it doesn't bloat the whole thing.
>>>
>> Yes, in many cases a solid classloading mechanism is a must and OSGi
>> definitly implements this properly.
>> I think, we can leave this open (= do not need to require OSGi) if we
>> have an open way of registering the plugins. Registering in an OSGi
>> environment might then be slightly different compared to registering in
>> a non OSGi environmnent. Of course, using the latter one might result in
>> classloading problems :) But then it's up to the developer to decide in
>> which environment tika should run with all the pros and cons that come
>> with this decision.
>
> +1 I think that the core Tika framework should be very lightweigth and
> easily composable in various different environments. I even think that
> we shouldn't mandate any "official" configuration or composition
> mechanism. We may have some simple implementation as the default, but
> it should be possible to use things like Spring or OSGi or whatever to
> manage more complex scenarios.
>
> BR,
>
> Jukka Zitting


Reply | Threaded
Open this post in threaded view
|

Re: Questions

Rida Benjelloun
+1
Rida Benjelloun

On 7/8/07, Chris Mattmann <[hidden email]> wrote:

>
> +1 here too. I would love to have a light-weight plugin loading mechanism,
> and like the idea of not having to pick a single mechanism.
>
> Cheers,
>   Chris
>
>
>
> On 7/2/07 4:38 AM, "Jukka Zitting" <[hidden email]> wrote:
>
> > Hi,
> >
> > On 7/2/07, Carsten Ziegeler <[hidden email]> wrote:
> >> Bertrand Delacretaz wrote:
> >>> I've mentioned Tika to a few colleagues lately, and one thing that
> >>> comes up often is that there are many document/format parsing
> >>> libraries around, which should ideally be usable as Tika plugins with
> >>> as little changes as possible.
> >>>
> >>> But these libraries' dependencies are all around the place, and
> >>> probably conflicting in many cases.
> >>>
> >>> It might be good to take that into account in the design of Tika, and
> >>> use solid classloading and isolation mechanisms. OSGI comes to mind,
> >>> assuming it doesn't bloat the whole thing.
> >>>
> >> Yes, in many cases a solid classloading mechanism is a must and OSGi
> >> definitly implements this properly.
> >> I think, we can leave this open (= do not need to require OSGi) if we
> >> have an open way of registering the plugins. Registering in an OSGi
> >> environment might then be slightly different compared to registering in
> >> a non OSGi environmnent. Of course, using the latter one might result
> in
> >> classloading problems :) But then it's up to the developer to decide in
> >> which environment tika should run with all the pros and cons that come
> >> with this decision.
> >
> > +1 I think that the core Tika framework should be very lightweigth and
> > easily composable in various different environments. I even think that
> > we shouldn't mandate any "official" configuration or composition
> > mechanism. We may have some simple implementation as the default, but
> > it should be possible to use things like Spring or OSGi or whatever to
> > manage more complex scenarios.
> >
> > BR,
> >
> > Jukka Zitting
>
>
>