Nutch doesn't support Korean?

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Byron Miller-2
I like to think of it as a framework. Building blocks
to build what you ultimately need.

If your after the one stop shop, plug in play, no
development necessary then perhaps some other
commercial systems may be your best bet.

Mailing list is very active, most people get responses
fairly quickly. If the question is ignored its often
because it's already answered.

To really understand nutch you need to understand
lucene, hadoop and search in general and the wiki of
both lucene and nutch is a great read.

If all of this is above ones head or not within your
time frame to bother with then like i said, there are
other products out there.

Other then that i'm happily running nutch, looking
forward to a billion+ page index and enjoying picking
the brains of the talent pool we have here.

Happy nutcher

-byron
http://www.mozdex.com


--- Matt Wilkie <[hidden email]> wrote:

> Hi there, I'm new around here. The mailing lists
> seem to have a pretty
> steady stream of traffic but the website hasn't been
> updated since
> august, and there's only a handful of news items
> before that. What is
> the vitality of Nutch project? Is it basically a
> labority proof of
> concept or a mature ready for production product?
>
> thanks for your time,
>
> --
> matt wilkie
> --------------------------------------------
> Geographic Information,
> Information Management and Technology,
> Yukon Department of Environment
> 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
> 867-667-8133 Tel * 867-393-7003 Fax
> http://environmentyukon.gov.yk.ca/geomatics/
> --------------------------------------------
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Re: project vitality?

Greg Boulter
In reply to this post by Stefan Groschupf-2
Hi,

I think that this is my first post. I follow the mailing list and read as
many of the emails as I can.

I'm going to make a few proposals.
I have obtained some money to spend on them.
I use and get paid for my nutch expertise.
I have some experience.
I don't just speak for myself but also for some people who use nutch now,
have a commercial interest in nutch and who will contribute money to the
effort.
This money is not a great deal but it could both escalate and become
ongoing.
I sympathize with the people who are (with no offense to any "side", if
there really is one) the "complainers".
I am grateful to the coders.
I can and do make code improvements to nutch for my own uses that nobody
ever sees.
I have a web interface (sort of), and many other tools that work with nutch,
from maps to communication with nutch via telephone.
I expect to gain from my association with nutch although how I can't really
put my finger on yet.
I wouldn't say that I'm frustrated - I'd describe it more as a feeling of
hope mixed with helplessness and despair.
I think the moment is almost gone.
I"m old and scatterbrained and don't spell check or reread before I post.
I will elaborate as soon as I see this on the list - but I don't like to
type until I know what I have to deal with, I have about 3000 emails a day
to sift through and I have so many email addresses I've signed up for that I
never really know whether I'm going to hit the wrong list or something or
whatever.

Greg.
Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

David Wallace-3-2
In reply to this post by Matt Wilkie-2
Hello all,
I think Nutch is a fantastic product.  I used 0.6 initially, then 0.7.
My 0.7 installation is in production, and mostly works really well.  I
haven't made the move to 0.8 yet, because the direction that Nutch has
gone for 0.8 is quite different from what my organisation requires from
its search engine.
 
I owe Doug and the team a huge thank-you for all the effort they've put
into Nutch.  Well done.
 
However, it's a sad day when someone like Richard Braman gets shot down
in flames for making some fair and valid criticisms of the Nutch
project.  Apart from his statement about Nutch being in "proof of
concept" stage, I agree with everything Richard has said.  The
documentation DOES leave a fair bit to be desired.  The initial learning
curve CAN be precipitous.  It's easy to get confused with all the
various settings in the XML configuration files and the various
plug-ins.  I can understand that he doesn't feel that he's in a position
to contribute to the documentation base, because he doesn't know all the
answers yet.
 
I think moving everything, including the tutorial, to the Wiki is a
fine idea; provided that we encourage new users to comment on what did
and didn't work for them.  I think we'll find there's a lot of common
ground among their comments.  Long-term readers of the nutch-user
mailing list know that many newbies ask the same questions.  
 
Also, I've lost count of the number of times someone has posted
something to the effect of "I'll pay someone to give me Nutch support",
simply because they find the existing documentation and mailing lists
inadequate.  Usually, that person gets told that the best way to get
Nutch support is to ask questions on the mailing list; but since
questions often go unanswered, this isn't a very good way to get Nutch
support at all.
 
All of this is acceptable in a product that hasn't yet reached "version
1.0".  The code has moved ahead faster than the documentation; and
that's fine, provided the documentation will eventually catch up.
Maybe, once 0.8 is deemed production-worthy, the team should down tools,
stop coding, and put some effort into really producing a really lovely
set of documentation, including a comprehensive FAQ.  I believe that
this will help grow the user base, faster than adding new features ever
could.
 
So in summary, well done to the Nutch team for this great product.
Well done to Richard Braman for pointing out what could be done.  And
let's all not flame people whose opinions differ from our own.
 
David.

********************************************************************************
This email may contain legally privileged information and is intended only for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are not the intended recipient you must not use, disclose, copy or distribute this email or
information in it. If you have received this email in error, please contact the sender immediately. NZQA does not accept any liability for changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its network.

********************************************************************************
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Chris Lamprecht
In reply to this post by Byron Miller-2
I think of the Nutch project as a marathon, not a sprint.  Nutch's
stated goals include:

* Scale to entire web
- pages on millions of different servers
- billions of pages
* Support high traffic
- thousands of searches per second
* State-of-the-art search quality

(see http://wiki.apache.org/nutch/Presentations)

It's inspiring to see a project with such ambitious goals become a reality.


On 3/5/06, Byron Miller <[hidden email]> wrote:

> I like to think of it as a framework. Building blocks
> to build what you ultimately need.
>
> If your after the one stop shop, plug in play, no
> development necessary then perhaps some other
> commercial systems may be your best bet.
>
> Mailing list is very active, most people get responses
> fairly quickly. If the question is ignored its often
> because it's already answered.
>
> To really understand nutch you need to understand
> lucene, hadoop and search in general and the wiki of
> both lucene and nutch is a great read.
>
> If all of this is above ones head or not within your
> time frame to bother with then like i said, there are
> other products out there.
>
> Other then that i'm happily running nutch, looking
> forward to a billion+ page index and enjoying picking
> the brains of the talent pool we have here.
>
> Happy nutcher
>
> -byron
> http://www.mozdex.com
>
>
> --- Matt Wilkie <[hidden email]> wrote:
>
> > Hi there, I'm new around here. The mailing lists
> > seem to have a pretty
> > steady stream of traffic but the website hasn't been
> > updated since
> > august, and there's only a handful of news items
> > before that. What is
> > the vitality of Nutch project? Is it basically a
> > labority proof of
> > concept or a mature ready for production product?
> >
> > thanks for your time,
> >
> > --
> > matt wilkie
> > --------------------------------------------
> > Geographic Information,
> > Information Management and Technology,
> > Yukon Department of Environment
> > 10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
> > 867-667-8133 Tel * 867-393-7003 Fax
> > http://environmentyukon.gov.yk.ca/geomatics/
> > --------------------------------------------
> >
> >
>
>
Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Richard Braman
In reply to this post by David Wallace-3-2
I think "Proof of concept" means different things to different people.
I am sorry I ever used those words, aside from the possible benefit of
getting people a little "fired up" which may perpetuate some needed
changes.  It is more fairly, something like beta.  

I don't take anything anyone says on a mailing list comcerning my
opinions personally (say something about my mama, maybe a different
story :) ).  I just want to see nutch get more users, and that goal
requires those with the knowledge to answer the questions and make sure
nutch is relatively easy to use.  

I also want to see nutch:

Scale to the entire web, and
have State-of-the-art search quality

I think those 2 goals require better tactics than PDFBox
stripper.gettext();

I will take on the challenge myself and glady share the developments
back with the community.  Anyone interested in joining in can contact me
and their contributions will be welcome.

I do think putting FAQs and tutorials on the Wikie, is much better than
having to go on an easter egg hunt through the mailing list archives,
which is how I solved many of my problems.  It was only when I couldn't
make sense of what I read that I posted a question.

I would also be happy to edit Stephan's wiki for english grammar as he
indicated his english was "terrible", which may be a little overstated.
That's the least I can do for the help he has given me so far.



-----Original Message-----
From: David Wallace [mailto:[hidden email]]
Sent: Sunday, March 05, 2006 5:39 PM
To: [hidden email]
Subject: RE: project vitality?


Hello all,
I think Nutch is a fantastic product.  I used 0.6 initially, then 0.7.
My 0.7 installation is in production, and mostly works really well.  I
haven't made the move to 0.8 yet, because the direction that Nutch has
gone for 0.8 is quite different from what my organisation requires from
its search engine.
 
I owe Doug and the team a huge thank-you for all the effort they've put
into Nutch.  Well done.
 
However, it's a sad day when someone like Richard Braman gets shot down
in flames for making some fair and valid criticisms of the Nutch
project.  Apart from his statement about Nutch being in "proof of
concept" stage, I agree with everything Richard has said.  The
documentation DOES leave a fair bit to be desired.  The initial learning
curve CAN be precipitous.  It's easy to get confused with all the
various settings in the XML configuration files and the various
plug-ins.  I can understand that he doesn't feel that he's in a position
to contribute to the documentation base, because he doesn't know all the
answers yet.
 
I think moving everything, including the tutorial, to the Wiki is a fine
idea; provided that we encourage new users to comment on what did and
didn't work for them.  I think we'll find there's a lot of common ground
among their comments.  Long-term readers of the nutch-user mailing list
know that many newbies ask the same questions.  
 
Also, I've lost count of the number of times someone has posted
something to the effect of "I'll pay someone to give me Nutch support",
simply because they find the existing documentation and mailing lists
inadequate.  Usually, that person gets told that the best way to get
Nutch support is to ask questions on the mailing list; but since
questions often go unanswered, this isn't a very good way to get Nutch
support at all.
 
All of this is acceptable in a product that hasn't yet reached "version
1.0".  The code has moved ahead faster than the documentation; and
that's fine, provided the documentation will eventually catch up.
Maybe, once 0.8 is deemed production-worthy, the team should down tools,
stop coding, and put some effort into really producing a really lovely
set of documentation, including a comprehensive FAQ.  I believe that
this will help grow the user base, faster than adding new features ever
could.
 
So in summary, well done to the Nutch team for this great product.
Well done to Richard Braman for pointing out what could be done.  And
let's all not flame people whose opinions differ from our own.
 
David.

************************************************************************
********
This email may contain legally privileged information and is intended
only for the addressee. It is not necessarily the official view or
communication of the New Zealand Qualifications Authority. If you are
not the intended recipient you must not use, disclose, copy or
distribute this email or
information in it. If you have received this email in error, please
contact the sender immediately. NZQA does not accept any liability for
changes made to this email or attachments after sending by NZQA.

All emails have been scanned for viruses and content by MailMarshal.
NZQA reserves the right to monitor all email communications through its
network.

************************************************************************
********

Reply | Threaded
Open this post in threaded view
|

Re: [Nutch-general] Re: project vitality?

Greg Boulter
In reply to this post by Greg Boulter
Hello again.

OK - first of all I hate mailing lists. I don't consider them to be a valid
form of communication for anything but the people doing the coding and don't
really consider them of much use at all unless there is no other
alternative. Except one - and that is when there needs to be something
communicated to the people doing the work and it has to get through - in
other words I think mailing lists are a last resort.

I've been a part of a few areas of the net where what I was involved with
just took off. One of them was in 1999 when Flash 4 came out and suddenly
anyone with an ability to use Flash was hot and Flash was the big news and I
was part of a forum called "were-here.com" which was the "adult" flash forum
as opposed to the kids' "flashkit.com" site. My name was/is Mapp and for the
most part of were-here's life I was moderator of the XML forum. I think that
if anyone has or cares to read my posts they'll see that I always try to
help, my help was usually complete, I am always polite. We had quite a ride
for awhile but then the owners of the forum for some secretive reason just
took the site down leaving the thousands of contributing posters "homeless".
I still keep up with all the XML stuff and I suppose I must be sort of an
expert in XML - at least in knowing the different formats, vxml, aiml, on
and on.

I was also part of a few areas of the net where it looked like things were
going to take off and never did. One thing I noticed is that technologies
that take off have forums dedicated to them and ones that don't take off
resist going off the mailing list.

I like it how people say "take it off list" but oh where should it be taken
to please? Nobody says "take the discussion to the wiki" because
traditionally wikis aren't real discussion areas. What really should be said
is "take it to the forum" but there isn't really one is there? If there is
nobody says anything. I have the name "nutchforum.com" and am #1 in MSN,
Google and Yahoo and one person posted there one day. I know there are other
efforts too but if they have any good discussions about relevant topics I'm
unaware of them.

I agree that the people doing the coding shouldn't have to read this and so
obviously I'm proposing a nutch forum with myself for example (could be
others too) as a moderator. At least I have a history and it is decent.
Were-here.com is back up now - bought by a corporation and maintained as a
learning resource to the Flash community  but I don't post there much and
that is because I resented my hundreds if not thousands of hours of
painstakingly trying to give back to "the community" by being complete,
coherent, etc lost because whoever happened to have the "luck" of owning the
forum decided that oh well, see you around, I'm going to work for Microsoft,
or whatever. I still resent it even if some corporation knew that they could
garner enough good will by buying the forum and restoring the posts/knowlege
base.

So, what I've done is pick "Moodle" - an open source php learning system,
which has a forum and I've decided that I'll attempt to start a useful forum
and that what I'll do is every week or two make the forum sql dump available
so if I ever decide that I don't care about anyone or I get snapped up by
Google any knowlege will live on. Moodle is being developed by teachers, the
people I'd trust to do things right (except for librarians - check out the
open source library software that librarians write for an example of a
dominant open source effort). So I assume that any forum posting will be
long-lived and "free".

I've also decided to pay for posts - the surest way for a forum/community to
not get started is by there being no posting activity. So, I arranged to get
posts paid for. I'm not sure yet how much is reasonable but I started off
figuring that a few dollars for a well thought out question and 20 -100
dollars for a reasonably comprehensive answer might be alright. Also, I've
arranged for some hosting space for people who want to make search engines
but don't have the resources. I have a few dedicated servers and unique IP
addresses and the like for people who will share their experiences. I don't
know what is reasonable to pay but I have arranged some funding and
resources albeit with conditions.

Also there are other things that normally cost money as well as I'll give
support to people who want to use the "web interface" that I've been working
on and if somebody else has an idea that needs a little money well right now
the people that I've set up with older not so up to date nutch search
engines are becoming desperate to get the stuff I told them would be
available to them. These aren't people who want billion page indexes spread
over 10 separate beowulf clusters - they're just people who thought they
could spend a few hundred and get some additional functionality out of open
source software. That being what I do mostly, set up and integrate open
source software for people who have reasonable goals. I'm old now and not as
competitive as I once might have been.

Anyway, I agree this discussion should go off list - if anyone cares to go
to http://www.nutchforum.com I will discuss/help/be helped there. Thanks
again to the people who work on nutch.

Greg.
Reply | Threaded
Open this post in threaded view
|

RE: [Nutch-general] Re: project vitality?

Richard Braman
I'll take part in your forum. Just added first post.

-----Original Message-----
From: Greg Boulter [mailto:[hidden email]]
Sent: Sunday, March 05, 2006 6:33 PM
To: [hidden email]
Subject: Re: [Nutch-general] Re: project vitality?


Hello again.

OK - first of all I hate mailing lists. I don't consider them to be a
valid form of communication for anything but the people doing the coding
and don't really consider them of much use at all unless there is no
other alternative. Except one - and that is when there needs to be
something communicated to the people doing the work and it has to get
through - in other words I think mailing lists are a last resort.

I've been a part of a few areas of the net where what I was involved
with just took off. One of them was in 1999 when Flash 4 came out and
suddenly anyone with an ability to use Flash was hot and Flash was the
big news and I was part of a forum called "were-here.com" which was the
"adult" flash forum as opposed to the kids' "flashkit.com" site. My name
was/is Mapp and for the most part of were-here's life I was moderator of
the XML forum. I think that if anyone has or cares to read my posts
they'll see that I always try to help, my help was usually complete, I
am always polite. We had quite a ride for awhile but then the owners of
the forum for some secretive reason just took the site down leaving the
thousands of contributing posters "homeless". I still keep up with all
the XML stuff and I suppose I must be sort of an expert in XML - at
least in knowing the different formats, vxml, aiml, on and on.

I was also part of a few areas of the net where it looked like things
were going to take off and never did. One thing I noticed is that
technologies that take off have forums dedicated to them and ones that
don't take off resist going off the mailing list.

I like it how people say "take it off list" but oh where should it be
taken to please? Nobody says "take the discussion to the wiki" because
traditionally wikis aren't real discussion areas. What really should be
said is "take it to the forum" but there isn't really one is there? If
there is nobody says anything. I have the name "nutchforum.com" and am
#1 in MSN, Google and Yahoo and one person posted there one day. I know
there are other efforts too but if they have any good discussions about
relevant topics I'm unaware of them.

I agree that the people doing the coding shouldn't have to read this and
so obviously I'm proposing a nutch forum with myself for example (could
be others too) as a moderator. At least I have a history and it is
decent. Were-here.com is back up now - bought by a corporation and
maintained as a learning resource to the Flash community  but I don't
post there much and that is because I resented my hundreds if not
thousands of hours of painstakingly trying to give back to "the
community" by being complete, coherent, etc lost because whoever
happened to have the "luck" of owning the forum decided that oh well,
see you around, I'm going to work for Microsoft, or whatever. I still
resent it even if some corporation knew that they could garner enough
good will by buying the forum and restoring the posts/knowlege base.

So, what I've done is pick "Moodle" - an open source php learning
system, which has a forum and I've decided that I'll attempt to start a
useful forum and that what I'll do is every week or two make the forum
sql dump available so if I ever decide that I don't care about anyone or
I get snapped up by Google any knowlege will live on. Moodle is being
developed by teachers, the people I'd trust to do things right (except
for librarians - check out the open source library software that
librarians write for an example of a dominant open source effort). So I
assume that any forum posting will be long-lived and "free".

I've also decided to pay for posts - the surest way for a
forum/community to not get started is by there being no posting
activity. So, I arranged to get posts paid for. I'm not sure yet how
much is reasonable but I started off figuring that a few dollars for a
well thought out question and 20 -100 dollars for a reasonably
comprehensive answer might be alright. Also, I've arranged for some
hosting space for people who want to make search engines but don't have
the resources. I have a few dedicated servers and unique IP addresses
and the like for people who will share their experiences. I don't know
what is reasonable to pay but I have arranged some funding and resources
albeit with conditions.

Also there are other things that normally cost money as well as I'll
give support to people who want to use the "web interface" that I've
been working on and if somebody else has an idea that needs a little
money well right now the people that I've set up with older not so up to
date nutch search engines are becoming desperate to get the stuff I told
them would be available to them. These aren't people who want billion
page indexes spread over 10 separate beowulf clusters - they're just
people who thought they could spend a few hundred and get some
additional functionality out of open source software. That being what I
do mostly, set up and integrate open source software for people who have
reasonable goals. I'm old now and not as competitive as I once might
have been.

Anyway, I agree this discussion should go off list - if anyone cares to
go to http://www.nutchforum.com I will discuss/help/be helped there.
Thanks again to the people who work on nutch.

Greg.

Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Thomas Delnoij-3
In reply to this post by Stefan Groschupf-2
Stefan.

> I know people having >500 mio pages index and I personal run crawls with
~300 pages per second.

Sorry, but I have to ask: what kind of setup do you have (network, hw, nutch
version) that you manage so many pages per second?

Unless this is a "company secret", it would be very nice to know how you
manage this.

Rgrds, Thomas
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Stefan Groschupf-2
Hi Thomas,
for this crawl setup we have a test environment of nutch 0.8,  
10xAMD's, custom linux build,  100Mbit eth1, 1Gb eth0, each box has a  
'caching' dns server.
Stefan
Am 06.03.2006 um 15:59 schrieb TDLN:

> Stefan.
>
>> I know people having >500 mio pages index and I personal run  
>> crawls with
> ~300 pages per second.
>
> Sorry, but I have to ask: what kind of setup do you have (network,  
> hw, nutch
> version) that you manage so many pages per second?
>
> Unless this is a "company secret", it would be very nice to know  
> how you
> manage this.
>
> Rgrds, Thomas

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

mos-2
In reply to this post by Stefan Groschupf-2
On 3/4/06, Stefan Groschupf:

> Just a general note, jira has a voting functionality.
> This allows everybody to vote an issue and can show in a very
> compressed style what the community is looking for.
> However it is not used that often yet. It would be great if more
> users can use it.

That's a good suggestion.
I want to make adv
Because there is a bug in Nutch 0.7.1 which forces me, to

http://issues.apache.org/jira/browse/NUTCH-205
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

mos-2
On 3/4/06, Stefan Groschupf:

> Just a general note, jira has a voting functionality.
> This allows everybody to vote an issue and can show in a very
> compressed style what the community is looking for.
> However it is not used that often yet. It would be great if more
> users can use it.

That's a good suggestion.
I want to make some advertising for my favorite. ;)
Because there is a bug in Nutch 0.7.1 which forces me, to make
complete recrawls instead of using the incremental approach, this is my
voting recommendation:
http://issues.apache.org/jira/browse/NUTCH-205

Bye the way:
I totally agree with the exchanged opinions.

- Nutch is a great project and has the chance to become a very very
popular and robust open source software. A big thankyou to all nutch
developer is more than appropriate:
Thanks guys!

- On the other hand: As Richard wrote there could be some improvements
in documentation and in responses to mailing-list and reported
jira-issues.

My concrete suggestions:

Nutch 0.8 should be available in around the next two months. Let's
take the chance and
improve the (wiki-)documentation before releasing it.
First lets specify what kind of documentation we like to have in 0.8.
I'm sure we'll get for every documentation-subject volunteers for
writing it down and some more volunteers for checking and testing it.

I would like to support the documentation-project in the next weeks
(as far as my spare times is available;))
Reply | Threaded
Open this post in threaded view
|

move from nutch 0.71 to 0.8

waterwheel
I've seen it noted that a complete recrawl is necessary to migrate from
0.71 to 0.8.  Is this absolutely necessary?  Or could a converter be
created to migrate the data?  Has anyone created this?

I expect at some point I'll have to move versions and something like
this would be very useful.  If it's not been done yet and is possible,
we'll likely tackle it at some point.


>  
>
Reply | Threaded
Open this post in threaded view
|

Re: move from nutch 0.71 to 0.8

Andrzej Białecki-2
Insurance Squared Inc. wrote:
> I've seen it noted that a complete recrawl is necessary to migrate
> from 0.71 to 0.8.  Is this absolutely necessary?  Or could a converter
> be created to migrate the data?  Has anyone created this?
> I expect at some point I'll have to move versions and something like
> this would be very useful.  If it's not been done yet and is possible,
> we'll likely tackle it at some point.

Yes, it should be possible to write a converter, but it's a lot of work
... You would need to code conversion routines for many structures,
because essentially all data containers are different in these two versions.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

RE: Nutch doesn't support Korean?

T. Kuro Kurosaka
In reply to this post by T. Kuro Kurosaka
Thank you.  I filed a new bug NUTC-224.
http://issues.apache.org/jira/browse/NUTCH-224

> -----Original Message-----
> From: Cheolgoo Kang [mailto:[hidden email]]
> Sent: 2006-3-03 20:49
> To: [hidden email]
> Subject: Re: Nutch doesn't support Korean?
>
> Hello,
>
> There was similar issue with Lucene's StandardTokenizer.jj.
>
> http://issues.apache.org/jira/browse/LUCENE-444
>
> and
>
> http://issues.apache.org/jira/browse/LUCENE-461
>
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Doug Cutting
In reply to this post by Richard Braman
Richard Braman wrote:
> I realy do think nutch is great, but I echo Matthias's comments that the
> community needs to come together and contirbute more back.  And that
> comes with the requirement of making sure volunteers are given access to
> make their contributions part of the project.

Here's how it works:

One has to be a committer to directly change the code.

One may be invited to become a committer if contributes a number of
non-trivial, consistently exemplary patches.

Exemplary patches:
  1. are easy for a committer to apply;
  2. fix one thing;
  3. fix it well;
  4. are well formatted, using Sun's coding conventions
  5. are well documented, with Javadoc for all non-private items
  6. pass all existing unit tests
  7. includes new unit tests
  8. etc.

An exemplary patch is thus something that a committer can commit with
little hesitation.  It follows that exemplary patches will be committed
quickly.  Lesser patches are likely to languish.

For example, a committer might be reluctant to take on a poorly
constructed patch for a bug that only affects niche users, since it may
take a lot of time to turn it into code worthy of committing.

Most committers are already doing as much as they can to help the
project.  The trick is not to get them committers to do more work, but
for others to do more work for the committers, and,eventually, to get
more committers.

> Putting the faqs and tutorial on the website and not the wiki maybe one
> of the two biggest problems in getting people started learning nutch.

If you think these should move, don't just complain: file a bug, make
your case, submit a patch, etc.  The website is part of the source and
is governed by the same process.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Doug Cutting
In reply to this post by David Wallace-3-2
David Wallace wrote:
> Also, I've lost count of the number of times someone has posted
> something to the effect of "I'll pay someone to give me Nutch support",
> simply because they find the existing documentation and mailing lists
> inadequate.  Usually, that person gets told that the best way to get
> Nutch support is to ask questions on the mailing list; but since
> questions often go unanswered, this isn't a very good way to get Nutch
> support at all.

I agree this is a problem, but it is also an opportunity. I do try to
answer Nutch questions whenever I have time, and most other Nutch
developers are also active on these lists.  The problem is simply that
there are more questions than question answering hours.

> All of this is acceptable in a product that hasn't yet reached "version
> 1.0".  The code has moved ahead faster than the documentation; and
> that's fine, provided the documentation will eventually catch up.

Yes, I hope it will.

> Maybe, once 0.8 is deemed production-worthy, the team should down tools,
> stop coding, and put some effort into really producing a really lovely
> set of documentation, including a comprehensive FAQ.  I believe that
> this will help grow the user base, faster than adding new features ever
> could.

That would be nice.  Once things settle down it will also be easier for
support organizations, consultants, book authors, etc, to step in and
improve documentation too.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Matt Wilkie-2
In reply to this post by Matt Wilkie-2
Thank you everyone for giving me a through-the-keyhole view of the Nutch
project. I really appreciate the time it takes to read messages and
composing a reply -- time which could otherwise be spent coding or
writing documentation. ;)

I am somewhat saddened, but unsurprised, to find a slightly antagonistic
  relationship between the coders and the users. I've followed a number
of open source projects and a polarised dialogue seems to arise of it's
own accord, not always mind you, but often. I'll depart with a
suggestion which in the past I've seen provide to some lubrication and
make for a more easeful coder-user dialogue:

On the mailing list, initiate a "Summary" convention, wherein people who
ask questions are politely asked to summarise the results and post back
to the list. How it works:

Ms Newbie asks "how do I get Nutch to crawl my coffee pot archives?".
Half a dozen people reply with an asortment of tips. Some are terse,
"read faq 13.4", and some offer a little more hand holding, "first
connect the archive via caffeine plug 3a, then power up the filter
holder", and another adds "but don't forget to place receiving
receptable A1:Final under the spout or your results will be all over the
floor". Ms Newbie then posts a message titled "SUM: crawling coffee pot
archives" back to the mailing list summarising the suggestions and her
results.

The "SUM" or "Summary:" part is important for people searching the
archives. They want to start with the results before crawling back
through the initiating questions.

Until the convention has been used enough to become natural, it will be
necessary to *politely* remind/ask questioners to summarise.

There will always be some who just won't or can't summarise. Don't waste
time chewing them out for it, that just adds noise. After a suitable
interval ask them nicely once or twice to summarise, if there is still
nothing forthcoming simply stop responding to their questions in an
informative way.

Lead by example. It will take some time for the custom to gel. A small
handful will need to resign themselves to going it alone for awhile. It
won't be forever, people know a good thing when they see it (eventually!).

Keep an eye on the SUMs and periodically grab the juicy ones and
reformat for the wiki and/or documentation.

Summarisers: Please don't just concatenate the replies into one big
verbatim message -- we can read the mailing list for full details! Keep
only the core info which really helps. Strip out signatures, anecdotes,
chatter, unneeded controversy and anything else which doesn't answer the
question. Also, always credit the people who've taken time out of
*their* work to help you with *yours*. P:-)

I've run out of time today, tommorrow I'll SUMmarise this thread to show
more concretely what I mean. After that, well, if it works use it, if
not leave it to collect dust in the bit bucket and move on.

cheers,

--
matt wilkie
--------------------------------------------
Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/
--------------------------------------------

12