How to Become a Nutch Developer

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

How to Become a Nutch Developer

Dennis Kubes
All,

I am working on a "How to Become a Nutch Developer" document for the
wiki and I need some input.

I need an overview of how the process for JIRA works?  If I am a
developer new to Nutch and just starting to look at the JIRA and I want
to start working on some piece of functionality or to help with bug
fixes where would I look.

Would I just choose something that is unscheduled and begin working on it?

What if I see something that I want to work on but it is scheduled to
somebody else?

Are items only scheduled to committers or can they be scheduled to
developers as well?  If they can be scheduled to regular developers how
does someone get their name on the list to be scheduled items?

Should I submit a JIRA and/or notify the list before I start working on
something?  What is the common process for this?

When I submit a JIRA is there anything else I need to do either in the
JIRA system or with the mailing lists, committers, etc?

Getting this information together in one place will go a long way toward
helping others to start contributing more and more.  Thanks for all your
input.

Dennis Kubes
Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Andrzej Białecki-2
Dennis Kubes wrote:

> All,
>
> I am working on a "How to Become a Nutch Developer" document for the
> wiki and I need some input.
>
> I need an overview of how the process for JIRA works?  If I am a
> developer new to Nutch and just starting to look at the JIRA and I
> want to start working on some piece of functionality or to help with
> bug fixes where would I look.
>
> Would I just choose something that is unscheduled and begin working on
> it?

Well ... so far this process was very informal, because there were so
few key developers that they more or less knew what needs to be done,
and who is doing what.

Hadoop follows a much stricter and formalized model, which we could
adopt, since it apparently works well there. This should address the
issue of notifying others that the work is started on this or that item.

Regarding the picking of the work to be done - natural ordering in JIRA
should be followed, i.e. issues marked critical are more important than
"major", and the ones with a lot of votes are more important than those
without any.

And of course even if something is not that important, but there's some
kind soul who wants to work on it, we shouldn't discourage him.

>
> What if I see something that I want to work on but it is scheduled to
> somebody else?

You should always contact that person and coordinate the efforts. That's
only polite and sensible.

>
> Are items only scheduled to committers or can they be scheduled to
> developers as well?  If they can be scheduled to regular developers
> how does someone get their name on the list to be scheduled items?

I don't have any opinion on this, and I'm not sure how it works with
JIRA - are only committers eligible for JIRA accounts? I'm fine with a
non-committer developer working on patches, leaving just the final step
to one of the committers.

>
> Should I submit a JIRA and/or notify the list before I start working
> on something?  What is the common process for this?

See above for the process in Hadoop. Speaking for myself, when I start
working on something bigger that is not tracked in JIRA yet I usually
notify the list. If it's in JIRA I usually add a comment that I'm
working on a patch.

>
> When I submit a JIRA is there anything else I need to do either in the
> JIRA system or with the mailing lists, committers, etc?

I think that using proper tags in JIRA (which release, which subsystem,
environment etc) goes a long way, and of course a patch helps a lot, too. :)

>
> Getting this information together in one place will go a long way
> toward helping others to start contributing more and more.  Thanks for
> all your input.

Thanks for taking this initiative!

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

chrismattmann
In reply to this post by Dennis Kubes
Hi Dennis,


On 1/21/07 11:47 AM, "Dennis Kubes" <[hidden email]> wrote:

> All,
>
> I am working on a "How to Become a Nutch Developer" document for the
> wiki and I need some input.
>
> I need an overview of how the process for JIRA works?  If I am a
> developer new to Nutch and just starting to look at the JIRA and I want
> to start working on some piece of functionality or to help with bug
> fixes where would I look.

JIRA provides a lot of search facilities: it's actually kind of nice. The
starting point for browsing bugs and other types of issues is:

http://issues.apache.org/jira/browse/NUTCH

(in general, for all Apache projects that use JIRA, you'll find that their
issue tracking system boils down to:

http://issues.apache.org/jira/browse/<APACHE_PROJ_JIRA_ID>
)

From there, you can access canned filters for open issues like:
Blocker
Critical
Major
Minor
Trivial

For more detailed search capabilities, click on the "Find Issues" button at
the top breadcrumb bar. Search capabilities there include the ability to
look for issues by developer, status, issue type, and to combine such fields
using AND, and OR. Additionally, you can issue a free text query across all
issues by using the free text box there.

>
> Would I just choose something that is unscheduled and begin working on it?

That's a good starting point: additionally, high priority issues marked as
"Blockers", "Critical" and "Major" are always good because the sooner we
(the committers) get a patch for those, the sooner we'll be testing it for
inclusion into the sources.

>
> What if I see something that I want to work on but it is scheduled to
> somebody else?

Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you
don't have to do that. ;) Just speak up on the mailing list, and volunteer
your support. One of the people listed in the group "nutch-developers" in
JIRA (e.g., the committers) can reassign the issue to you so long as the
other gent it was assigned to doesn't mind...

>
> Are items only scheduled to committers or can they be scheduled to
> developers as well?  If they can be scheduled to regular developers how
> does someone get their name on the list to be scheduled items?

Items can be scheduled to folks listed in the nutch-developers group within
JIRA. Most of these folks are the committers, however, not all of them are.
I'm not entirely sure how folks get into that group (maybe Doug?), however,
that's the real criteria for having a JIRA issue officially assigned to you.
However, that doesn't mean that you can't work on things in lieu of that. If
there's an issue that you'd like to contribute to, please, prepare a patch,
attach it to JIRA, and then speak up on the mailing list. Chances are, with
the recent busy schedules of the committers (including myself) besides Sami,
and Andrzej, the committers don't have time to prepare patches for the issue
assigned to them. If you contribute a great patch, the committer will pick
it up, test it, apply it, and you'll get the same effect as if the issue
were directly assigned to you.
>
> Should I submit a JIRA and/or notify the list before I start working on
> something?  What is the common process for this?

Yup, that's pretty much it. Voice your desire to work on a particular task
on the nutch-dev list. Many of the developers on that list have been around
for a while now, and they know what's been discussed, and implemented
before.
>
> When I submit a JIRA is there anything else I need to do either in the
> JIRA system or with the mailing lists, committers, etc?

Nope: the nutch-dev list is automatically notified by all JIRA issue
submissions, and the committers (and rest of the folks) will pick up on this
and act accordingly.

>
> Getting this information together in one place will go a long way toward
> helping others to start contributing more and more.  Thanks for all your
> input.

No probs, glad to be of service :-)

Cheers,
  Chris

>
> Dennis Kubes


Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Zaheed Haque
In reply to this post by Andrzej Białecki-2
On 1/21/07, Andrzej Bialecki <[hidden email]> wrote:

>
> Well ... so far this process was very informal, because there were so
> few key developers that they more or less knew what needs to be done,
> and who is doing what.
>
> Hadoop follows a much stricter and formalized model, which we could
> adopt, since it apparently works well there. This should address the
> issue of notifying others that the work is started on this or that item.

My 2 cents :-) .. I like the way Hadoop guys works! It is strict but you to my
mind it brings more benefit to be structured/rigid for the newbie developer
cos you can follow every issue from start to end and all the comments in between
I have notice some of the mailing list questions/answers related to
issues for example
are not in Nutch JIRA so to follow an issue you have to
go-back-and-forth consult
mailing list and JIRA.

IMHO Nutch should adopt Hadoop model furthermore its probably to good idea to
discuss it further cos soon Nutch will have an 0.9 release and
probably its a good time to
change to Hadoop style :-)

Just some thoughts.

Cheers
Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Dennis Kubes
Zaheed Haque wrote:

> On 1/21/07, Andrzej Bialecki <[hidden email]> wrote:
>
>>
>> Well ... so far this process was very informal, because there were so
>> few key developers that they more or less knew what needs to be done,
>> and who is doing what.
>>
>> Hadoop follows a much stricter and formalized model, which we could
>> adopt, since it apparently works well there. This should address the
>> issue of notifying others that the work is started on this or that item.
>
> My 2 cents :-) .. I like the way Hadoop guys works! It is strict but you
> to my
> mind it brings more benefit to be structured/rigid for the newbie developer
> cos you can follow every issue from start to end and all the comments in
> between
> I have notice some of the mailing list questions/answers related to
> issues for example
> are not in Nutch JIRA so to follow an issue you have to
> go-back-and-forth consult
> mailing list and JIRA.

What does the Hadoop project do differently than Nutch.  I thought they
both were run about the same way?  Is it that all communication on
issues goes through the JIRA?
>
> IMHO Nutch should adopt Hadoop model furthermore its probably to good
> idea to
> discuss it further cos soon Nutch will have an 0.9 release and
> probably its a good time to
> change to Hadoop style :-)

I am for productivity and sometimes that requires change.  Better now
than later if it will help more people get involved and be productive.
>
> Just some thoughts.
>
> Cheers
Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Dennis Kubes
In reply to this post by chrismattmann
Thanks to everyone for the input.  I know some of these questions are
obvious but I wanted to take it from the lowest possible level.

Part of the document is already posted to the wiki here.

http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

It seems like I am getting a section done each night so everything
should be done it a couple of days.

Dennis Kubes

Chris Mattmann wrote:

> Hi Dennis,
>
>
> On 1/21/07 11:47 AM, "Dennis Kubes" <[hidden email]> wrote:
>
>> All,
>>
>> I am working on a "How to Become a Nutch Developer" document for the
>> wiki and I need some input.
>>
>> I need an overview of how the process for JIRA works?  If I am a
>> developer new to Nutch and just starting to look at the JIRA and I want
>> to start working on some piece of functionality or to help with bug
>> fixes where would I look.
>
> JIRA provides a lot of search facilities: it's actually kind of nice. The
> starting point for browsing bugs and other types of issues is:
>
> http://issues.apache.org/jira/browse/NUTCH
>
> (in general, for all Apache projects that use JIRA, you'll find that their
> issue tracking system boils down to:
>
> http://issues.apache.org/jira/browse/<APACHE_PROJ_JIRA_ID>
> )
>
> From there, you can access canned filters for open issues like:
> Blocker
> Critical
> Major
> Minor
> Trivial
>
> For more detailed search capabilities, click on the "Find Issues" button at
> the top breadcrumb bar. Search capabilities there include the ability to
> look for issues by developer, status, issue type, and to combine such fields
> using AND, and OR. Additionally, you can issue a free text query across all
> issues by using the free text box there.
>
>> Would I just choose something that is unscheduled and begin working on it?
>
> That's a good starting point: additionally, high priority issues marked as
> "Blockers", "Critical" and "Major" are always good because the sooner we
> (the committers) get a patch for those, the sooner we'll be testing it for
> inclusion into the sources.
>
>> What if I see something that I want to work on but it is scheduled to
>> somebody else?
>
> Walk five paces opposite your opponent: turn, then sho...err, wait. Nah, you
> don't have to do that. ;) Just speak up on the mailing list, and volunteer
> your support. One of the people listed in the group "nutch-developers" in
> JIRA (e.g., the committers) can reassign the issue to you so long as the
> other gent it was assigned to doesn't mind...
>
>> Are items only scheduled to committers or can they be scheduled to
>> developers as well?  If they can be scheduled to regular developers how
>> does someone get their name on the list to be scheduled items?
>
> Items can be scheduled to folks listed in the nutch-developers group within
> JIRA. Most of these folks are the committers, however, not all of them are.
> I'm not entirely sure how folks get into that group (maybe Doug?), however,
> that's the real criteria for having a JIRA issue officially assigned to you.
> However, that doesn't mean that you can't work on things in lieu of that. If
> there's an issue that you'd like to contribute to, please, prepare a patch,
> attach it to JIRA, and then speak up on the mailing list. Chances are, with
> the recent busy schedules of the committers (including myself) besides Sami,
> and Andrzej, the committers don't have time to prepare patches for the issue
> assigned to them. If you contribute a great patch, the committer will pick
> it up, test it, apply it, and you'll get the same effect as if the issue
> were directly assigned to you.
>> Should I submit a JIRA and/or notify the list before I start working on
>> something?  What is the common process for this?
>
> Yup, that's pretty much it. Voice your desire to work on a particular task
> on the nutch-dev list. Many of the developers on that list have been around
> for a while now, and they know what's been discussed, and implemented
> before.
>> When I submit a JIRA is there anything else I need to do either in the
>> JIRA system or with the mailing lists, committers, etc?
>
> Nope: the nutch-dev list is automatically notified by all JIRA issue
> submissions, and the committers (and rest of the folks) will pick up on this
> and act accordingly.
>
>> Getting this information together in one place will go a long way toward
>> helping others to start contributing more and more.  Thanks for all your
>> input.
>
> No probs, glad to be of service :-)
>
> Cheers,
>   Chris
>
>> Dennis Kubes
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Andrzej Białecki-2
In reply to this post by Dennis Kubes
Dennis Kubes wrote:
> What does the Hadoop project do differently than Nutch.  I thought
> they both were run about the same way?  Is it that all communication
> on issues goes through the JIRA?

The workflow is different - I'm not sure about the details, perhaps Doug
can correct me if I'm wrong ... and yes, it uses JIRA extensively.

1. An issue is created
2. patches are added, removed commented, etc...
3. finally, a candidate patch is selected, and the issue is marked
"Patch available".
4. An automated process applies the patch to a temporary copy, and
checks whether it compiles and passes junit tests.
5. A list of patches in this state is available, and committers may pick
from this list and apply them.
6. An explicit link is made between the issue and the change set
committed to svn (Is this automated?)
7. The issue is marked as "Resolved", but not closed. I believe issues
are closed only when a release is made, because issues in state
"resolved" make up the Changelog. I believe this is also automated.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Doug Cutting
Andrzej Bialecki wrote:
> The workflow is different - I'm not sure about the details, perhaps Doug
> can correct me if I'm wrong ... and yes, it uses JIRA extensively.
>
> 1. An issue is created
> 2. patches are added, removed commented, etc...
> 3. finally, a candidate patch is selected, and the issue is marked
> "Patch available".

"Patch Available" is code for "the contributor now believes this is
ready to commit".  Once a patch is in this state, a committer reviews it
and either commits it or rejects it, changing the state of the issue
back to "Open".  The set of issues in "Patch Available" thus forms a
work queue for committers.  We try not to let a patch sit in this state
for more than a few days.

> 4. An automated process applies the patch to a temporary copy, and
> checks whether it compiles and passes junit tests.

This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't
be hard to run this for Nutch on lucene.zones.apache.org, and I think
Nigel would probably gladly share his scripts.  This step saves
committers time: if a patch doesn't pass unit tests, or has javadoc
warnings, etc. this can be identified automatically.

> 5. A list of patches in this state is available, and committers may pick
> from this list and apply them.
> 6. An explicit link is made between the issue and the change set
> committed to svn (Is this automated?)

Jira does this based on commit messages.  Any bug ids mentioned in a
commit message create links from that bug to the revision in subversion.
  Hadoop commits messages usually start with the bug id, e.g.,
"HADOOP-1234.  Remove a deadlock in the oscillation overthruster."

> 7. The issue is marked as "Resolved", but not closed. I believe issues
> are closed only when a release is made, because issues in state
> "resolved" make up the Changelog. I believe this is also automated.

Jira will put resolved issues into the release notes regardless of
whether they're closed.  The reason we close issues on release is to
keep folks from re-opening them.  We want the release notes to be the
list of changes in a release, so we don't want folks re-opening issues
and having new commits made against them, since then the changes related
to the issue will span multiple releases.  If an issue is closed but
there's still a problem, a new issue should be created linking to the
prior issue, so that the new issue can be scheduled and tracked without
modifying what should be a read-only release.

Doug


Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Dennis Kubes
+1 for adopting the same types of process with Nutch.

Doug Cutting wrote:

> Andrzej Bialecki wrote:
>> The workflow is different - I'm not sure about the details, perhaps
>> Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively.
>>
>> 1. An issue is created
>> 2. patches are added, removed commented, etc...
>> 3. finally, a candidate patch is selected, and the issue is marked
>> "Patch available".
>
> "Patch Available" is code for "the contributor now believes this is
> ready to commit".  Once a patch is in this state, a committer reviews it
> and either commits it or rejects it, changing the state of the issue
> back to "Open".  The set of issues in "Patch Available" thus forms a
> work queue for committers.  We try not to let a patch sit in this state
> for more than a few days.
>
>> 4. An automated process applies the patch to a temporary copy, and
>> checks whether it compiles and passes junit tests.
>
> This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't
> be hard to run this for Nutch on lucene.zones.apache.org, and I think
> Nigel would probably gladly share his scripts.  This step saves
> committers time: if a patch doesn't pass unit tests, or has javadoc
> warnings, etc. this can be identified automatically.
>
>> 5. A list of patches in this state is available, and committers may
>> pick from this list and apply them.
>> 6. An explicit link is made between the issue and the change set
>> committed to svn (Is this automated?)
>
> Jira does this based on commit messages.  Any bug ids mentioned in a
> commit message create links from that bug to the revision in subversion.
>  Hadoop commits messages usually start with the bug id, e.g.,
> "HADOOP-1234.  Remove a deadlock in the oscillation overthruster."
>
>> 7. The issue is marked as "Resolved", but not closed. I believe issues
>> are closed only when a release is made, because issues in state
>> "resolved" make up the Changelog. I believe this is also automated.
>
> Jira will put resolved issues into the release notes regardless of
> whether they're closed.  The reason we close issues on release is to
> keep folks from re-opening them.  We want the release notes to be the
> list of changes in a release, so we don't want folks re-opening issues
> and having new commits made against them, since then the changes related
> to the issue will span multiple releases.  If an issue is closed but
> there's still a problem, a new issue should be created linking to the
> prior issue, so that the new issue can be scheduled and tracked without
> modifying what should be a read-only release.
>
> Doug
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Dennis Kubes
In reply to this post by Doug Cutting
Doug

Can you answer the question of how to add developer names to JIRA or if
that is only for committers?

Dennis

Doug Cutting wrote:

> Andrzej Bialecki wrote:
>> The workflow is different - I'm not sure about the details, perhaps
>> Doug can correct me if I'm wrong ... and yes, it uses JIRA extensively.
>>
>> 1. An issue is created
>> 2. patches are added, removed commented, etc...
>> 3. finally, a candidate patch is selected, and the issue is marked
>> "Patch available".
>
> "Patch Available" is code for "the contributor now believes this is
> ready to commit".  Once a patch is in this state, a committer reviews it
> and either commits it or rejects it, changing the state of the issue
> back to "Open".  The set of issues in "Patch Available" thus forms a
> work queue for committers.  We try not to let a patch sit in this state
> for more than a few days.
>
>> 4. An automated process applies the patch to a temporary copy, and
>> checks whether it compiles and passes junit tests.
>
> This is currently hosted by Yahoo!, run by Nigel Daley, but it wouldn't
> be hard to run this for Nutch on lucene.zones.apache.org, and I think
> Nigel would probably gladly share his scripts.  This step saves
> committers time: if a patch doesn't pass unit tests, or has javadoc
> warnings, etc. this can be identified automatically.
>
>> 5. A list of patches in this state is available, and committers may
>> pick from this list and apply them.
>> 6. An explicit link is made between the issue and the change set
>> committed to svn (Is this automated?)
>
> Jira does this based on commit messages.  Any bug ids mentioned in a
> commit message create links from that bug to the revision in subversion.
>  Hadoop commits messages usually start with the bug id, e.g.,
> "HADOOP-1234.  Remove a deadlock in the oscillation overthruster."
>
>> 7. The issue is marked as "Resolved", but not closed. I believe issues
>> are closed only when a release is made, because issues in state
>> "resolved" make up the Changelog. I believe this is also automated.
>
> Jira will put resolved issues into the release notes regardless of
> whether they're closed.  The reason we close issues on release is to
> keep folks from re-opening them.  We want the release notes to be the
> list of changes in a release, so we don't want folks re-opening issues
> and having new commits made against them, since then the changes related
> to the issue will span multiple releases.  If an issue is closed but
> there's still a problem, a new issue should be created linking to the
> prior issue, so that the new issue can be scheduled and tracked without
> modifying what should be a read-only release.
>
> Doug
>
>
Reply | Threaded
Open this post in threaded view
|

Re: How to Become a Nutch Developer

Doug Cutting
Dennis Kubes wrote:
> Can you answer the question of how to add developer names to JIRA or if
> that is only for committers?

It's not just for committers, but also for regular contributors.  I have
added you.  Anyone else?

Doug