[general discussion, moved from TIKA-7]

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[general discussion, moved from TIKA-7]

chrismattmann

>> I think that these questions need to be answered before we move forward with
>> more code development.

>>I disagree. I would prefer to have some concrete code in SVN, and I think the
>>stuff from Rida is a good starting point. Often it is much easier to discuss
>>design issues if you have concrete code that you can point to as an example. I
>>also much prefer an evolving codebase over a waterfall model where we first
>>design the "perfect" architecture and only then start implementing it.

I'm not sure I follow you here Jukka. I wasn't saying that we shouldn't have
code in SVN, simply, that we should properly design the way that the system
is going to work before we start moving code, "just to have code" within
SVN. I don't think everyone should just start dumping the sources into
Tika's SVN, and then we'll just have everyone sort it out moving forward.
I'm fine with having code for Tika, however, we at least need to have:

1. use cases for Tika (how does a user interact with it?)
2. generic interfaces and extension points that will support these use cases
3. implementations of those interfaces and concrete classes

We have a few cases for item #1, however, there are no specs for #2 and #3,
which must come at least during this time when new code is getting attached,
no? That's all I was calling for: a discussion of items #2 and #3, and
things like that, before we start moving code over and having Tika be a
warehouse for code from the 3 projects:

From the Tika proposal:

"No existing codebase is selected as "the" starting point of Tika to avoid
inheriting the world view and design limitations of any single project. "

Am I off base here?

Cheers,
  Chris
 


Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Jukka Zitting
Hi,

On 6/13/07, Chris Mattmann <[hidden email]> wrote:
> I'm not sure I follow you here Jukka. I wasn't saying that we shouldn't have
> code in SVN, simply, that we should properly design the way that the system
> is going to work before we start moving code, "just to have code" within
> SVN.

See my other message on this. I'm a bit concerned about our ability to
have a productive "pure" design discussion without at least some code
to base it on. We've already had a few design threads, but each seems
to have died with no real conclusions. I believe that having some
concrete code that people can play with will have a positive impact
also on higher level discussions.

> I'm fine with having code for Tika, however, we at least need to have:
>
> 1. use cases for Tika (how does a user interact with it?)
> 2. generic interfaces and extension points that will support these use cases
> 3. implementations of those interfaces and concrete classes
>
> We have a few cases for item #1, however, there are no specs for #2 and #3,
> which must come at least during this time when new code is getting attached,
> no?

No. :-) Having a shared area where we can prototype and discuss
alternatives (I regard code as another means of communication) is
quite valuable when coming up with answers to the open design issues.
We can also always refactor, rewrite, or simply dump existing code if
and when needed since we aren't yet making any backwards compatibility
promises.

> From the Tika proposal:
>
> "No existing codebase is selected as "the" starting point of Tika to avoid
> inheriting the world view and design limitations of any single project. "
>
> Am I off base here?

I very much agree with that statement, and I don't think we are
breaking it here. I think it's quite clear to everyone that the code
we have now (and will have for the months to come) is an early draft
that can and will be dropped if needed. I also quite like the way Rida
has started merging code from both Lius and Nutch.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Bertrand Delacretaz
FWIW, I agree with what both Chris and Jukka said in this thread ;-)

We do need a good design before making an "API-stable" release.

At the same time, prototype code is a very good way of communicating
on possible designs, and trying them live for those who like that.

So I'm +1 on committing code early, and +1 on not considering this
code as final in any way.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

chrismattmann
In reply to this post by Jukka Zitting

> See my other message on this. I'm a bit concerned about our ability to
> have a productive "pure" design discussion without at least some code
> to base it on. We've already had a few design threads, but each seems
> to have died with no real conclusions. I believe that having some
> concrete code that people can play with will have a positive impact
> also on higher level discussions.

I completely disagree with this. You're saying, "we've tried to have design
discussions, and no one replied, so rather than attacking that issue, we're
going to just move ahead to prototyping." Screeech. I don't think that's the
right approach at all. We need to revisit the design discussions, otherwise
Rida (BTW, I'm not picking on you, just using you as an example) will start
checking Luis code, Chris will start checking in Nutch code, Bertrand will
start checking in Code for Apache project XX, and Doug will jump in and
commit some code he wrote to handling parsing from 10 years ago, and what
will be left with? One huge mess.

The solution to not getting a response on the design discussion is to
properly vet it on the mailing list again, track people down, those who were
interested in the project, those folks who should care, throw darts at them,
get them back to the mailing lists, and discuss discuss discuss :).
Admittedly I haven't really been participating in the discussions on the
mailing list until recently, but I'm here now, and seemingly from the
response today, so are a lot of people. So, I don't agree with your
strategy. No disrespect, just don't agree.

>
>> I'm fine with having code for Tika, however, we at least need to have:
>>
>> 1. use cases for Tika (how does a user interact with it?)
>> 2. generic interfaces and extension points that will support these use cases
>> 3. implementations of those interfaces and concrete classes
>>
>> We have a few cases for item #1, however, there are no specs for #2 and #3,
>> which must come at least during this time when new code is getting attached,
>> no?
>
> No. :-) Having a shared area where we can prototype and discuss
> alternatives (I regard code as another means of communication) is
> quite valuable when coming up with answers to the open design issues.
> We can also always refactor, rewrite, or simply dump existing code if
> and when needed since we aren't yet making any backwards compatibility
> promises.

Code committed to the trunk should be reasonably high quality, and should
conform to standard interfaces and exchange standard data structures that we
come up with. It shouldn't be a hodgepodge holding area where code gets
dumped, thrown away, dumped back, etc. In my (admittedly) short experience
as an Apache developer, and (admittedly) *long* experience as a developer
for a large organization, CM shouldn't be treated like a file system. It
certainly shouldn't have immeasurably strict rules on it, but also, it
shouldn't just be our internet-based zip drive either.

>
>> From the Tika proposal:
>>
>> "No existing codebase is selected as "the" starting point of Tika to avoid
>> inheriting the world view and design limitations of any single project. "
>>
>> Am I off base here?
>
> I very much agree with that statement, and I don't think we are
> breaking it here. I think it's quite clear to everyone that the code
> we have now (and will have for the months to come) is an early draft
> that can and will be dropped if needed. I also quite like the way Rida
> has started merging code from both Lius and Nutch.

I guess I need to review the patch for TIKA-7 more and see what's there. I
will do that, and then comment on this further. Again, I'm not trying to be
argumentative, just trying to get my point across. I don't want a bad
precedent to be started here, because I don't think that the project will
live long if we adopt that strategy.

Cheers,
  Chris


>
> BR,
>
> Jukka Zitting


Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Bertrand Delacretaz
On 6/14/07, Chris Mattmann <[hidden email]> wrote:

> ...Code committed to the trunk should be reasonably high quality, and should
> conform to standard interfaces and exchange standard data structures that we
> come up with. It shouldn't be a hodgepodge holding area where code gets
> dumped, thrown away, dumped back,....

I tend to disagree...or maybe I agree if by "trunk" you mean an area
for production-level code only, that will be passed on to the world.

If that's the case, then we also need a "sandbox" to play with, where
code can indeed be dumped for others to see, play with, tear apart,
and steal ideas from. If you ask me, a "sandbox" subdirectory would be
just fine for this, if we want to make it very clear what to expect in
there.

But I'm with Jukka in that we need code in there rather sooner than
later to move forward and to base our design discussions on concrete
stuff.

-Bertrand
Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Jukka Zitting
In reply to this post by chrismattmann
Hi,

On 6/14/07, Chris Mattmann <[hidden email]> wrote:
> Code committed to the trunk should be reasonably high quality, and should
> conform to standard interfaces and exchange standard data structures that we
> come up with. It shouldn't be a hodgepodge holding area where code gets
> dumped, thrown away, dumped back, etc.

Would it work for you if we started a separate sandbox where people
can put their prototype code up for review? That way we could keep the
trunk "clean" and only move code there when we have a good consensus
on the design.

At least I find it much easier to work on stuff that I can actually
checkout, build, and track changes on. Also, even though we don't yet
have the final interfaces and overall architecture in place, once the
code is in svn we can start parallel efforts in for example comparing
different parser libraries, setting up test suites, etc.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Jukka Zitting
In reply to this post by chrismattmann
Hi,

On 6/14/07, Chris Mattmann <[hidden email]> wrote:
> I completely disagree with this. You're saying, "we've tried to have design
> discussions, and no one replied, so rather than attacking that issue, we're
> going to just move ahead to prototyping."

I regard prototyping as primarily a design tool, so in my mind I *am*
attacking the issue at hand. :-)

But yes, I see your point and I do appreciate it. Let's find a way
forward that works for everyone.

BR,

Jukka Zitting
Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

chrismattmann
In reply to this post by Jukka Zitting
Hi Jukka,

> Would it work for you if we started a separate sandbox where people
> can put their prototype code up for review? That way we could keep the
> trunk "clean" and only move code there when we have a good consensus
> on the design.

To me, this "sandbox" that you guys are talking about is essentially
equivalent to a JIRA issue, and attaching patch files there. JIRA is a
pretty easy way of sharing code, asking for comments/reviews/updates, etc.,
before it gets into the sources (i.e., the trunk). It also is a way for us
to track design discussions, and tie them to actual contributions into the
sources.

>
> At least I find it much easier to work on stuff that I can actually
> checkout, build, and track changes on. Also, even though we don't yet
> have the final interfaces and overall architecture in place, once the
> code is in svn we can start parallel efforts in for example comparing
> different parser libraries, setting up test suites, etc.

Sure, I hear ya, and maybe I'm adopting too strict an interpretation of CM.
I'm not against moving forward with the creation of this "sandbox" area, and
certainly not against moving forward with development of code in general for
the project.

I am going to take a look at TIKA-7 today or tomorrow, and I'll get back to
you guys with my comments on it, and perhaps we can talk a bit about the
overall design of the Tika parsing architecture at that time.

Thanks!

Cheers,
  Chris


>
> BR,
>
> Jukka Zitting


Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Michael Busch
Chris Mattmann wrote:

>
> To me, this "sandbox" that you guys are talking about is essentially
> equivalent to a JIRA issue, and attaching patch files there. JIRA is a
> pretty easy way of sharing code, asking for comments/reviews/updates, etc.,
> before it gets into the sources (i.e., the trunk). It also is a way for us
> to track design discussions, and tie them to actual contributions into the
> sources.
>
>
>  

I think that a sandbox has advantages here compared to JIRA. Yes, JIRA is
good to submit patches and to comment on issues. But it is quite unsuitable
for evolving code. Take for example TIKA-7: it's a huge patch. Now a couple
of people suggest changes and submit more patches. Then you end up having
lot's of files attached to that issue, maybe even conflicting ones. IMO it's
easier to have such a big piece of code in the sandbox. Then people can
open different issues and submit patches based on the sandbox code. Those
patches will then be much easier to apply and test on your local checkout.

- Michael

Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

Sami Siren-2
Michael Busch wrote:

> Chris Mattmann wrote:
>>
> I think that a sandbox has advantages here compared to JIRA. Yes, JIRA is
> good to submit patches and to comment on issues. But it is quite unsuitable
> for evolving code. Take for example TIKA-7: it's a huge patch. Now a couple
> of people suggest changes and submit more patches. Then you end up having
> lot's of files attached to that issue, maybe even conflicting ones. IMO
> it's
> easier to have such a big piece of code in the sandbox. Then people can
> open different issues and submit patches based on the sandbox code. Those
> patches will then be much easier to apply and test on your local checkout.

I agree with Michael here and also because there currently exists
nothing worth protecting (no release = no requirement for any backward
compatibility) in trunk it is IMO no more than one big sandbox there.

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: [general discussion, moved from TIKA-7]

chrismattmann

>> I think that a sandbox has advantages here compared to JIRA. Yes, JIRA is
>> good to submit patches and to comment on issues. But it is quite unsuitable
>> for evolving code. Take for example TIKA-7: it's a huge patch. Now a couple
>> of people suggest changes and submit more patches. Then you end up having
>> lot's of files attached to that issue, maybe even conflicting ones. IMO
>> it's
>> easier to have such a big piece of code in the sandbox. Then people can
>> open different issues and submit patches based on the sandbox code. Those
>> patches will then be much easier to apply and test on your local checkout.
>
> I agree with Michael here and also because there currently exists
> nothing worth protecting (no release = no requirement for any backward
> compatibility) in trunk it is IMO no more than one big sandbox there.

I agree as well. Since there are no releases, etc., let's continue to
develop in the trunk. However, my call for design discussions still remains
and still is necessary IMO. I will take the lead on beginning these
discussions as soon by looking through and commenting on TIKA-7.

Thanks for everyone's feedback.

Cheers,
  Chris

______________________________________________
Chris A. Mattmann
[hidden email]
Key Staff Member
Modeling and Data Management Systems Section (387)
Data Management Systems and Technologies Group

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                        Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.