Nutch doesn't support Korean?

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Nutch doesn't support Korean?

T. Kuro Kurosaka
I was browing NutchAnalysis.jj and found that
Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
a Unicode character of the hex value xxxx) are not
part of LETTER or CJK class.  This seems to me that
Nutch cannot handle Korean documents at all.

Is anybody successfully using Nutch for Korean?

-kuro
Reply | Threaded
Open this post in threaded view
|

project vitality?

Matt Wilkie-2
Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature ready for production product?

thanks for your time,

--
matt wilkie
--------------------------------------------
Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/
--------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Richard Braman
I think it is still very much at proof of concept stage.  I think it is
close, but as you have mentioned, the website Is severely out of date
and the information and documentation on it lacks luster.  I have tried
to get the tutorial and faqs updated, but I haven't heard back.

-----Original Message-----
From: Matt Wilkie [mailto:[hidden email]]
Sent: Friday, March 03, 2006 6:34 PM
To: [hidden email]
Subject: project vitality?


Hi there, I'm new around here. The mailing lists seem to have a pretty
steady stream of traffic but the website hasn't been updated since
august, and there's only a handful of news items before that. What is
the vitality of Nutch project? Is it basically a labority proof of
concept or a mature ready for production product?

thanks for your time,

--
matt wilkie
--------------------------------------------
Geographic Information,
Information Management and Technology,
Yukon Department of Environment
10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
867-667-8133 Tel * 867-393-7003 Fax
http://environmentyukon.gov.yk.ca/geomatics/
--------------------------------------------

Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Howie Wang
I wouldn't call Nutch 0.7.x proof-of-concept. There are several
production sites running it already:

http://wiki.apache.org/nutch/PublicServers

Plus I think technorati is built on either Nutch and/or Lucene.

That said, the doc could be better, and it's probably a good idea
if you know Java since you might have to tweak the code a bit to
get the exact behavior you want.  If you don't have special needs,
you could get something like a site search up in very little time.

The newer versions seem to be changing a lot still though. I've
been waiting for the dust to settle before I see if I want to upgrade.

Howie

>I think it is still very much at proof of concept stage.  I think it is
>close, but as you have mentioned, the website Is severely out of date
>and the information and documentation on it lacks luster.  I have tried
>to get the tutorial and faqs updated, but I haven't heard back.
>
>-----Original Message-----
>From: Matt Wilkie [mailto:[hidden email]]
>Sent: Friday, March 03, 2006 6:34 PM
>To: [hidden email]
>Subject: project vitality?
>
>
>Hi there, I'm new around here. The mailing lists seem to have a pretty
>steady stream of traffic but the website hasn't been updated since
>august, and there's only a handful of news items before that. What is
>the vitality of Nutch project? Is it basically a labority proof of
>concept or a mature ready for production product?
>
>thanks for your time,
>
>--
>matt wilkie
>--------------------------------------------
>Geographic Information,
>Information Management and Technology,
>Yukon Department of Environment
>10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
>867-667-8133 Tel * 867-393-7003 Fax
>http://environmentyukon.gov.yk.ca/geomatics/
>--------------------------------------------
>


Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

gekkokid
passed the concept stage, technorati uses lucene, in open source projects
the last thing people want to do is documentation,

anybody know why yahoo took down their nutch server?


----- Original Message -----
From: "Howie Wang" <[hidden email]>
To: <[hidden email]>; <[hidden email]>
Sent: Saturday, March 04, 2006 1:09 AM
Subject: RE: project vitality?


>I wouldn't call Nutch 0.7.x proof-of-concept. There are several
> production sites running it already:
>
> http://wiki.apache.org/nutch/PublicServers
>
> Plus I think technorati is built on either Nutch and/or Lucene.
>
> That said, the doc could be better, and it's probably a good idea
> if you know Java since you might have to tweak the code a bit to
> get the exact behavior you want.  If you don't have special needs,
> you could get something like a site search up in very little time.
>
> The newer versions seem to be changing a lot still though. I've
> been waiting for the dust to settle before I see if I want to upgrade.
>
> Howie
>
>>I think it is still very much at proof of concept stage.  I think it is
>>close, but as you have mentioned, the website Is severely out of date
>>and the information and documentation on it lacks luster.  I have tried
>>to get the tutorial and faqs updated, but I haven't heard back.
>>
>>-----Original Message-----
>>From: Matt Wilkie [mailto:[hidden email]]
>>Sent: Friday, March 03, 2006 6:34 PM
>>To: [hidden email]
>>Subject: project vitality?
>>
>>
>>Hi there, I'm new around here. The mailing lists seem to have a pretty
>>steady stream of traffic but the website hasn't been updated since
>>august, and there's only a handful of news items before that. What is
>>the vitality of Nutch project? Is it basically a labority proof of
>>concept or a mature ready for production product?
>>
>>thanks for your time,
>>
>>--
>>matt wilkie
>>--------------------------------------------
>>Geographic Information,
>>Information Management and Technology,
>>Yukon Department of Environment
>>10 Burns Road * Whitehorse, Yukon * Y1A 4Y9
>>867-667-8133 Tel * 867-393-7003 Fax
>>http://environmentyukon.gov.yk.ca/geomatics/
>>--------------------------------------------
>>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch doesn't support Korean?

C. Kang
In reply to this post by T. Kuro Kurosaka
Hello,

There was similar issue with Lucene's StandardTokenizer.jj.

http://issues.apache.org/jira/browse/LUCENE-444

and

http://issues.apache.org/jira/browse/LUCENE-461

I'm have almost no experience with Nutch, but you can handle it like
those issues above.


On 3/4/06, Teruhiko Kurosaka <[hidden email]> wrote:

> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class.  This seems to me that
> Nutch cannot handle Korean documents at all.
>
> Is anybody successfully using Nutch for Korean?
>
> -kuro
>


--
Cheolgoo
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Doug Cutting
In reply to this post by Richard Braman
Richard Braman wrote:
> I think it is still very much at proof of concept stage.  I think it is
> close, but as you have mentioned, the website Is severely out of date
> and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks "luster" the project
must be dead!  Seriously, this is an active project.  It is not yet 1.0,
so don't expect polish.  If it doesn't look easily usable to you then
perhaps it is not.  It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch.  Some are listed at
http://wiki.apache.org/nutch/PublicServers, but many are not, like
http://search.bittorrent.com/.

> I have tried
> to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project.  If you find a bug, please file a bug
report, so that other folks are aware of it.  Better yet, if you have a
solution or improvement, please construct a patch file (even for
documentation) and attach it to a bug report.  On the wiki, anyone can
make themselves an account and update documentation.  We don't boss
folks around here, or complain.  We pitch in and help.

Doug
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

sudhendra seshachala
I could not agree with Doug more. This is one of the best.. am trying UIMA too... though UIMA also uses Lucene...as of today, it is still a framework and community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of nightly build..
   
  Doug,
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on releasing 0.8.
   
  Thanks
  Sudhi
 

Doug Cutting <[hidden email]> wrote:
  Richard Braman wrote:
> I think it is still very much at proof of concept stage. I think it is
> close, but as you have mentioned, the website Is severely out of date
> and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks "luster" the project
must be dead! Seriously, this is an active project. It is not yet 1.0,
so don't expect polish. If it doesn't look easily usable to you then
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.html

Lots of public sites are using Nutch. Some are listed at
http://wiki.apache.org/nutch/PublicServers, but many are not, like
http://search.bittorrent.com/.

> I have tried
> to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug
report, so that other folks are aware of it. Better yet, if you have a
solution or improvement, please construct a patch file (even for
documentation) and attach it to a bug report. On the wiki, anyone can
make themselves an account and update documentation. We don't boss
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


               
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.
Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Richard Braman

>don't expect polish.
You shouldn't need polish to be able to leran the command required to
resume an aborted drawl, or to index what you have already crawled.
Things like this shouldn't require an easter egg hunt.  They are going
to heppen to evryone doing greater than a simple crawl.

>If you find a bug, please file a bug report, so that other folks are
aware of it.  
I have reported 2 so far.  I have a third one (and a patch) that I am
still in the process of developing documenting, which relates to parsing
pdfs.

>Better yet, if you have a
>solution or improvement, please construct a patch file (even for
>documentation) and attach it to a bug report. On the wiki, anyone can
>make themselves an account and update documentation. We don't boss
>folks around here, or complain. We pitch in and help.

In the email I sent you I volunteered to help by offering to polish the
documentation myself.  I do need some answers first.  Many of the
questions that get asked on this list unfortunately go unanswered by the
experts.  If they go unanswered, it impossible for those who would
otherwise share their solutions on the Wiki, because there is no
solution to share.  

If I went and posted my knowledge about indexing and restarting crawls,
it wouldn't be any better than what is already up there, which is
incomplete and incorrect.  I know there are those of you that no nutch
inside and out. Right now that's just a few guys.  I know I want to know
more about it, that's why I am spending my free time trying to learn.
Everyting I am doing is part of an open source search project, not a
commercial endevour. I always contribute my knowledge back by posting
answers to things I know about.  

Documentation, whether we like it or not, is key to the use of the
product. The onus is on the developers to document the project, and to
provide support when the documentation is clearly lacking.  One the
developers share more of their knowledge, their will be more
knowledgable users and the developers wont need to spend as much time on
support and documentation.

I would agree that if you have 1 url to crawl, and you crawl it with
depth = 3-6 , nutch is easy to use.  I tried with depth=10, and I hit  a
snag.  This has been very hard to get through, given the lack of
documentation.  I have nutch up and running fine here
http://24.75.221.234:8080
But this is a simple crawl and doesn't reflect all of the pages needed
to make a good search engine.

I told you I was more than willing to help, and I think many users feel
the same way, but I for one feel that there is a lack of documentation
and support.  This isn't meant to offend anyone, if you are offended you
need to toughen up your skin a little bit.






-----Original Message-----
From: sudhendra seshachala [mailto:[hidden email]]
Sent: Saturday, March 04, 2006 1:26 AM
To: [hidden email]
Subject: Re: project vitality?


I could not agree with Doug more. This is one of the best.. am trying
UIMA too... though UIMA also uses Lucene...as of today, it is still a
framework and community in early stages..
   
  In fact the nightly builds has good improvements than 0.71.
  Any serious user or adopter should be trying with a snapshot of
nightly build..
   
  Doug,
  It  would be better, if there is official 0.8 release or atleast a RC.
  before major releasing 1.0. I am newbie, so let me know about ideas on
releasing 0.8.
   
  Thanks
  Sudhi
 

Doug Cutting <[hidden email]> wrote:
  Richard Braman wrote:
> I think it is still very much at proof of concept stage. I think it is

> close, but as you have mentioned, the website Is severely out of date
> and the information and documentation on it lacks luster.

It stands to reason that if the documentation lacks "luster" the project

must be dead! Seriously, this is an active project. It is not yet 1.0,
so don't expect polish. If it doesn't look easily usable to you then
perhaps it is not. It's still for early adopters.

The commit list shows a fair amount of activity:

http://www.mail-archive.com/nutch-commits%40lucene.apache.org/maillist.h
tml

Lots of public sites are using Nutch. Some are listed at
http://wiki.apache.org/nutch/PublicServers, but many are not, like
http://search.bittorrent.com/.

> I have tried
> to get the tutorial and faqs updated, but I haven't heard back.

This is an all-volunteer project. If you find a bug, please file a bug
report, so that other folks are aware of it. Better yet, if you have a
solution or improvement, please construct a patch file (even for
documentation) and attach it to a bug report. On the wiki, anyone can
make themselves an account and update documentation. We don't boss
folks around here, or complain. We pitch in and help.

Doug



  Sudhi Seshachala
  http://sudhilogs.blogspot.com/
   


               
---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.

Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

carmmello
I really can not agree with the way Mr. Richard Braman express his
views.  I have tried Nutch since version 0.3 and I could not make the
0.8 release  work (Nutch is becoming a little bit complicated with all
those map reduce, hadoop, and so on, that I can't deal with).  I
understand, however,  that if a product is not finished yet,  some times
it may fail with the lack of some fundamental documentation, but, if
there is a bunch of people who develops, for free, a product that is
commercially worth some thousands of dollars and may fit our purposes,
we have to say thanks.  After that we can, of course, express our views,
complaints and suggestions, but we should refrain from some hard, non
relevant comments, that goes nowhere, like this, non technical, post of
mine.
I, myself, have my own experimental implementation of Nutch 0.7.1.x (a
nightly version), with more than 400,000 pages, that can be, sometimes,
viewed at brazilian working hours, at
http://www.qualidade.eng.br/constelacao.htm .  It is in portuguese, but
english terms related to quality, standards and environment can be
searched.


Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Stefan Groschupf-2
In reply to this post by Richard Braman
Hi Richard,

> I told you I was more than willing to help, and I think many users  
> feel
> the same way, but I for one feel that there is a lack of documentation
> and support.  This isn't meant to offend anyone, if you are  
> offended you
> need to toughen up your skin a little bit.

Here you can find some more documentation:
http://wiki.media-style.com/display/nutchDocu/Home

It is the first hit when you are searching for nutch documentation  
with google.
Sure it is full with tons of typos and has  many language issues  
since my english is terrible
but at least I guess that it already helps some people to get a nutch  
0.7 or nutch 0.8 up and running.

Serious nutch is as much production ready as a noncommercial open  
source project could be.
I know people having >500 mio pages index and I personal run crawls  
with ~300 pages per second.

I'm not sure what you can expect more than that from a open source  
search project.

Stefan




Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Richard Braman
In reply to this post by carmmello
I do thank nutch developers very, very much for what they have put into
the project:)  I think the concept is great and yes it does work, if you
invest the time needed to learn the interfaces, updgrade the
distribution nightly, relearn the commands, etc. Doug's statement that
nutch is for early adopters is accurate.

Now that I have said that, I want to express my feeling that it's hard
when it takes a week to figure out that invertlinks only applies to
version 0.8. and when you ask to become a volunteer, you are met with no
response.  It's also frustrating when you share some heard earned
insights into something that nutch needs to work on, like pdf parsing,
and your comments don't get a single good response from the nutch dev
team.  

Sometimes, in OS projects I get the feeling that the developers breathe
different air than users, and that our help is not wanted or that our
questions are stupid and not worth their time to answer.  I don't feel
that there is really any such thing as a stupid question, only stupid
answers.  Some users even ask questions shamefully like: "I know I am a
newbie, and my question is stupid, but here it is anyway".  I think
that's a stigma that we as the larger computer community need to steer
away from, especially if we want newbie users to become advanced users.

Nutch is nowhere near being a dead project, that is not what I said (I
said it was close, not closed), its just that I don't feel that it's
something that anyone can just download and use without running into
problems.  Problems always exist, but need to be documented correctly so
that they can be solved quickly.  I think nutch has a long way to go
before it is comparable to tomcat or httpd, which are both production
ready and have literally volumes of information on using in every manner
possible.  

I am sorry if you don't like my opinion or the way it is expressed.

-----Original Message-----
From: carmmello [mailto:[hidden email]]
Sent: Saturday, March 04, 2006 10:54 AM
To: [hidden email]
Subject: RE: project vitality?


I really can not agree with the way Mr. Richard Braman express his
views.  I have tried Nutch since version 0.3 and I could not make the
0.8 release  work (Nutch is becoming a little bit complicated with all
those map reduce, hadoop, and so on, that I can't deal with).  I
understand, however,  that if a product is not finished yet,  some times
it may fail with the lack of some fundamental documentation, but, if
there is a bunch of people who develops, for free, a product that is
commercially worth some thousands of dollars and may fit our purposes,
we have to say thanks.  After that we can, of course, express our views,
complaints and suggestions, but we should refrain from some hard, non
relevant comments, that goes nowhere, like this, non technical, post of
mine. I, myself, have my own experimental implementation of Nutch
0.7.1.x (a nightly version), with more than 400,000 pages, that can be,
sometimes, viewed at brazilian working hours, at
http://www.qualidade.eng.br/constelacao.htm .  It is in portuguese, but
english terms related to quality, standards and environment can be
searched.

Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Howie Wang
In reply to this post by Richard Braman
I agree that the doc could be better, but I still take issue with
the earlier use of the phrase "proof-of-concept". If there are
dozens of sites using it in production, several of them indexing
100's of millions of pages, I don't know how you can call it
"proof-of-concept".

Honestly, I'm not sure if there's any other choice for a scalable
open source search engine. Last I checked most of the other
free projects were better suited to small site searches -- nothing
on the scale of tens of millions of pages.

So kudos, Nutch developers!

Howie


Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

chrismattmann
In reply to this post by Richard Braman
Hello,

 I've been following this conversation for the past week and decided that
I'd go ahead and chime in now. I think that honestly this whole thread of
discussion needs to be taken off list, because it doesn't really have
anything to do with the "use" of Nutch: what it boils down to is a list of
complaints, requests for improvements and what not. Nutch's goal is to be a
large-scale, open source search engine: it's not a PDF parsing framework,
nor is it as thoroughly documented as some commercial software -- although
I've ran into many commercial software products that don't have the same
quality of documentation that Nutch even has now in its nascent stages.

> Now that I have said that, I want to express my feeling that it's hard
> when it takes a week to figure out that invertlinks only applies to
> version 0.8. and when you ask to become a volunteer, you are met with no
> response.  

You don't need to "ask" to become a volunteer: just do it. As Doug said,
create a patch, submit the patch to JIRA and let the community look at it.
Change something on the Wiki if you don't think that the documentation is
particularly well there. Use Nutch to do whatever you like, and if you feel
that you contributed something that is applicable to a broader community
outside of your domain, let people know about it. If it's really cool, I
wouldn't worry about people ignoring you: they'll come around.

> It's also frustrating when you share some heard earned
> insights into something that nutch needs to work on, like pdf parsing,
> and your comments don't get a single good response from the nutch dev
> team.  

The nutch "dev team" isn't focused on PDF parsing. Nutch is a search engine
framework, and to Nutch, a PDF parser is a "black box" that conforms to a
standard parsing interface that can be swapped out as technology evolves.
Right now, Nutch uses PDFBox, but in a week it could use "hot super new rad
PDF parsing technology X.1", or some other greater PDF parser. If you feel
that PDFBox isn't getting the job done for your particular domain, then post
an actual question, not pointers to documents for the Nutch developers to go
read. Honestly, I'm guessing they don't have the time, nor the desire to go
read a whole bunch of PDF documentation unless there's a real use case, and
a real need to upgrade the existing parser. Empirically show that Nutch's
PDF capabilities aren't getting the job done, post your results to the list,
and let the community look them. I'd guess you'd generate more interest and
probably get a better response that way.

>
> Sometimes, in OS projects I get the feeling that the developers breathe
> different air than users, and that our help is not wanted or that our
> questions are stupid and not worth their time to answer.

As far as I can tell the Nutch developers all breathe the same air as us
(and moreover, I believe they put on their pants "one leg at a time")

>
> Nutch is nowhere near being a dead project, that is not what I said (I
> said it was close, not closed), its just that I don't feel that it's
> something that anyone can just download and use without running into
> problems.  

Problems is a generic word: I would agree with your statement if you
qualified what "problems" means. Small problems like configuration issues?
I'd buy that. Exception messages not providing super super detailed
information about the error? Sure, I'd even buy that in some cases. However,
larger, bigger problems that generally fall in the class of "bugs"? I would
say the answer to that is probably a "no".

> Problems always exist, but need to be documented correctly so
> that they can be solved quickly.  I think nutch has a long way to go
> before it is comparable to tomcat or httpd, which are both production
> ready and have literally volumes of information on using in every manner
> possible.  

Check out the commiters list on Tomcat (
http://tomcat.apache.org/whoweare.html) versus that of Nutch (
http://lucene.apache.org/nutch/credits.html). 21 active commiters on the
Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To have
the wealth of capability and functionality that Nutch provides, with the
ability to deploy it in production quality environments (which I can assure
you, after having been on the mailing lists for the better part of a year,
there are plenty), and its ease of use, I would have to respectfully
disagree with the majority of your assertions and say that the Nutch folks
are doing a great job.

Now, can we please take this discussion off the public mailing lists? I
would think that the majority of folks on the list would like to move on. I
know that I would.

Cheers,
  Chris


>
> I am sorry if you don't like my opinion or the way it is expressed.
>
> -----Original Message-----
> From: carmmello [mailto:[hidden email]]
> Sent: Saturday, March 04, 2006 10:54 AM
> To: [hidden email]
> Subject: RE: project vitality?
>
>
> I really can not agree with the way Mr. Richard Braman express his
> views.  I have tried Nutch since version 0.3 and I could not make the
> 0.8 release  work (Nutch is becoming a little bit complicated with all
> those map reduce, hadoop, and so on, that I can't deal with).  I
> understand, however,  that if a product is not finished yet,  some times
> it may fail with the lack of some fundamental documentation, but, if
> there is a bunch of people who develops, for free, a product that is
> commercially worth some thousands of dollars and may fit our purposes,
> we have to say thanks.  After that we can, of course, express our views,
> complaints and suggestions, but we should refrain from some hard, non
> relevant comments, that goes nowhere, like this, non technical, post of
> mine. I, myself, have my own experimental implementation of Nutch
> 0.7.1.x (a nightly version), with more than 400,000 pages, that can be,
> sometimes, viewed at brazilian working hours, at
> http://www.qualidade.eng.br/constelacao.htm .  It is in portuguese, but
> english terms related to quality, standards and environment can be
> searched.
>


Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Richard Braman
>The nutch "dev team" isn't focused on PDF parsing. Nutch is a search
engine framework,

IMHO, if you don't parse something correctly, you cannnot rely on the
results.  
We have all parsed things where you leave a comma out and the parse
results are wrong.  If there was a bug in nutches html parsing would
that be a big deal? Howabout if it parsed the text in a particular tag
out of order?  Pdf is unfortunately not html where you can parse the
file sequentially and get an accurate result, but its use is second most
ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has some
pdf parsing algorithms, that aren't being used.  Google does a good job
parsing pdf, nutch has to do if its ogin to compete.




-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Saturday, March 04, 2006 4:10 PM
To: [hidden email]
Subject: Re: project vitality?


Hello,

 I've been following this conversation for the past week and decided
that I'd go ahead and chime in now. I think that honestly this whole
thread of discussion needs to be taken off list, because it doesn't
really have anything to do with the "use" of Nutch: what it boils down
to is a list of complaints, requests for improvements and what not.
Nutch's goal is to be a large-scale, open source search engine: it's not
a PDF parsing framework, nor is it as thoroughly documented as some
commercial software -- although I've ran into many commercial software
products that don't have the same quality of documentation that Nutch
even has now in its nascent stages.

> Now that I have said that, I want to express my feeling that it's hard

> when it takes a week to figure out that invertlinks only applies to
> version 0.8. and when you ask to become a volunteer, you are met with
> no response.

You don't need to "ask" to become a volunteer: just do it. As Doug said,
create a patch, submit the patch to JIRA and let the community look at
it. Change something on the Wiki if you don't think that the
documentation is particularly well there. Use Nutch to do whatever you
like, and if you feel that you contributed something that is applicable
to a broader community outside of your domain, let people know about it.
If it's really cool, I wouldn't worry about people ignoring you: they'll
come around.

> It's also frustrating when you share some heard earned insights into
> something that nutch needs to work on, like pdf parsing, and your
> comments don't get a single good response from the nutch dev team.

The nutch "dev team" isn't focused on PDF parsing. Nutch is a search
engine framework, and to Nutch, a PDF parser is a "black box" that
conforms to a standard parsing interface that can be swapped out as
technology evolves. Right now, Nutch uses PDFBox, but in a week it could
use "hot super new rad PDF parsing technology X.1", or some other
greater PDF parser. If you feel that PDFBox isn't getting the job done
for your particular domain, then post an actual question, not pointers
to documents for the Nutch developers to go read. Honestly, I'm guessing
they don't have the time, nor the desire to go read a whole bunch of PDF
documentation unless there's a real use case, and a real need to upgrade
the existing parser. Empirically show that Nutch's PDF capabilities
aren't getting the job done, post your results to the list, and let the
community look them. I'd guess you'd generate more interest and probably
get a better response that way.

>
> Sometimes, in OS projects I get the feeling that the developers
> breathe different air than users, and that our help is not wanted or
> that our questions are stupid and not worth their time to answer.

As far as I can tell the Nutch developers all breathe the same air as us
(and moreover, I believe they put on their pants "one leg at a time")

>
> Nutch is nowhere near being a dead project, that is not what I said (I

> said it was close, not closed), its just that I don't feel that it's
> something that anyone can just download and use without running into
> problems.

Problems is a generic word: I would agree with your statement if you
qualified what "problems" means. Small problems like configuration
issues? I'd buy that. Exception messages not providing super super
detailed information about the error? Sure, I'd even buy that in some
cases. However, larger, bigger problems that generally fall in the class
of "bugs"? I would say the answer to that is probably a "no".

> Problems always exist, but need to be documented correctly so that
> they can be solved quickly.  I think nutch has a long way to go before

> it is comparable to tomcat or httpd, which are both production ready
> and have literally volumes of information on using in every manner
> possible.

Check out the commiters list on Tomcat (
http://tomcat.apache.org/whoweare.html) versus that of Nutch (
http://lucene.apache.org/nutch/credits.html). 21 active commiters on the
Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To
have the wealth of capability and functionality that Nutch provides,
with the ability to deploy it in production quality environments (which
I can assure you, after having been on the mailing lists for the better
part of a year, there are plenty), and its ease of use, I would have to
respectfully disagree with the majority of your assertions and say that
the Nutch folks are doing a great job.

Now, can we please take this discussion off the public mailing lists? I
would think that the majority of folks on the list would like to move
on. I know that I would.

Cheers,
  Chris


>
> I am sorry if you don't like my opinion or the way it is expressed.
>
> -----Original Message-----
> From: carmmello [mailto:[hidden email]]
> Sent: Saturday, March 04, 2006 10:54 AM
> To: [hidden email]
> Subject: RE: project vitality?
>
>
> I really can not agree with the way Mr. Richard Braman express his
> views.  I have tried Nutch since version 0.3 and I could not make the
> 0.8 release  work (Nutch is becoming a little bit complicated with all

> those map reduce, hadoop, and so on, that I can't deal with).  I
> understand, however,  that if a product is not finished yet,  some
> times it may fail with the lack of some fundamental documentation,
> but, if there is a bunch of people who develops, for free, a product
> that is commercially worth some thousands of dollars and may fit our
> purposes, we have to say thanks.  After that we can, of course,
> express our views, complaints and suggestions, but we should refrain
> from some hard, non relevant comments, that goes nowhere, like this,
> non technical, post of mine. I, myself, have my own experimental
> implementation of Nutch 0.7.1.x (a nightly version), with more than
> 400,000 pages, that can be, sometimes, viewed at brazilian working
> hours, at http://www.qualidade.eng.br/constelacao.htm .  It is in
> portuguese, but english terms related to quality, standards and
> environment can be searched.
>

Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Matthias Jaekle
In reply to this post by Richard Braman
 > I am sorry if you don't like my opinion or the way it is expressed.

Hi Richard,

most of your opinion I think is the same as mine. I use nutch now since
spring 2004 for our page http://www.umkreisfinder.de

It was a big effort to learn how nutch is working and also a big effort
to learn how to implement plugins. Seems to be a big system :)

Much of the stuff I know is about version 0.5 or maybe 0.7. It is really
difficult to keep up-to-date with all the stuff which is going on. In
the last month I did not have the time to read all the messages on the
mailing list, so I also feel less knowing about what's going on.
I think the only way to keep informed what's going on with nutch is to
read the mailing list each day. That's bad - I could not spent so much
time :(

Sometimes replies on the mailing list are extremly fast, sometimes there
is no response. No response for technical questions, no response if
volunteers ask how they could help and no response if bugfixes or code
snippets with some improvements are mailed to the mailing list.

I only can agree, if you think this is bad. It is bad.
Not only that there are persons, who would never come to a state where
they could help the project - because they did not get the first wattles
- also progress of the nutch project would be slowed down if bugfixes
and questions how to voluneer are ignored.

I only could suggest to post all patches and improvements to the jira
system, so that this information would never be lost.

For me it seems a little bit like many persons are working on the code
they need, sometimes two persons need the same code - fine -, but if
somebody is working on a project or bugfix nobody else of the community
currently needs - very bad. Also it is a big question, if and when
patches are submitted, which are in the moment only needed by their
programmer.

I thinks we - the whole nutch community - should think about how we
could generate the most value for nutch if persons ask how to volunteer.
And also we should think about how we could pay tribute for stuff made
by volunteres. Maybe if we simply check and add their improvements to
the offical code as soon as possible.

Maybe we should organize us ourself a little bit better in this point.
What do you think?

It also made be useful to ask all future volunteers to work on some
parts of the wiki to get a better documentation. Maybe some of the nutch
specialists must then look over the documentation is created by beginners.

May I ask: How much persons are currently working on nutch? How much
time do we alltogehter currently spend on nutch?

I am currently working on code to identify geographic information on
websites to improve local searches, but did not find time to implement
my ideas. Much other stuff to do :( I also feel that I should not start
implementing this code until I understand all the stuff which would be
new in the next release. Maybe I understand all the important new stuff
when reading the release information of the new version as soon as it is
available.

Last but not least, THANKS to all volunteers who worked on nutch. I am
glad to be able to use nutch for our services. It is great to have the
code of all the volunteers and run them together with the one percent of
the code I have developed for our website.

Thanks for reading my post

Matthias

--
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events
Reply | Threaded
Open this post in threaded view
|

RE: project vitality?

Richard Braman
I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back.  And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project.

Also, if you use nutch you should be answering other users questions as
long as you are actively reading the nutch list and you know the answer.
That’s is almost your obligation for using free open source software.

Putting the faqs and tutorial on the website and not the wiki maybe one
of the two biggest problems in getting people started learning nutch.

-----Original Message-----
From: Matthias Jaekle [mailto:[hidden email]]
Sent: Saturday, March 04, 2006 5:27 PM
To: [hidden email]
Subject: Re: project vitality?


 > I am sorry if you don't like my opinion or the way it is expressed.

Hi Richard,

most of your opinion I think is the same as mine. I use nutch now since
spring 2004 for our page http://www.umkreisfinder.de

It was a big effort to learn how nutch is working and also a big effort
to learn how to implement plugins. Seems to be a big system :)

Much of the stuff I know is about version 0.5 or maybe 0.7. It is really

difficult to keep up-to-date with all the stuff which is going on. In
the last month I did not have the time to read all the messages on the
mailing list, so I also feel less knowing about what's going on. I think
the only way to keep informed what's going on with nutch is to
read the mailing list each day. That's bad - I could not spent so much
time :(

Sometimes replies on the mailing list are extremly fast, sometimes there

is no response. No response for technical questions, no response if
volunteers ask how they could help and no response if bugfixes or code
snippets with some improvements are mailed to the mailing list.

I only can agree, if you think this is bad. It is bad.
Not only that there are persons, who would never come to a state where
they could help the project - because they did not get the first wattles

- also progress of the nutch project would be slowed down if bugfixes
and questions how to voluneer are ignored.

I only could suggest to post all patches and improvements to the jira
system, so that this information would never be lost.

For me it seems a little bit like many persons are working on the code
they need, sometimes two persons need the same code - fine -, but if
somebody is working on a project or bugfix nobody else of the community
currently needs - very bad. Also it is a big question, if and when
patches are submitted, which are in the moment only needed by their
programmer.

I thinks we - the whole nutch community - should think about how we
could generate the most value for nutch if persons ask how to volunteer.
And also we should think about how we could pay tribute for stuff made
by volunteres. Maybe if we simply check and add their improvements to
the offical code as soon as possible.

Maybe we should organize us ourself a little bit better in this point.
What do you think?

It also made be useful to ask all future volunteers to work on some
parts of the wiki to get a better documentation. Maybe some of the nutch

specialists must then look over the documentation is created by
beginners.

May I ask: How much persons are currently working on nutch? How much
time do we alltogehter currently spend on nutch?

I am currently working on code to identify geographic information on
websites to improve local searches, but did not find time to implement
my ideas. Much other stuff to do :( I also feel that I should not start
implementing this code until I understand all the stuff which would be
new in the next release. Maybe I understand all the important new stuff
when reading the release information of the new version as soon as it is

available.

Last but not least, THANKS to all volunteers who worked on nutch. I am
glad to be able to use nutch for our services. It is great to have the
code of all the volunteers and run them together with the one percent of

the code I have developed for our website.

Thanks for reading my post

Matthias

--
http://www.eventax.com - eventax GmbH http://www.umkreisfinder.de - Die
Suchmaschine für Lokales und Events

Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

Stefan Groschupf-2
In reply to this post by Matthias Jaekle
>
> Maybe we should organize us ourself a little bit better in this point.
> What do you think?

Just a general note, jira has a voting functionality.
This allows everybody to vote an issue and can show in a very  
compressed style what the community is looking for.
However it is not used that often yet. It would be great if more  
users can use it.

Reading the nutch user list becomes very time consuming but browsing  
issues sorted by votes is very fast.

http://issues.apache.org/jira/browse/NUTCH?
report=com.atlassian.jira.plugin.system.project:popularissues-panel

Stefan
Reply | Threaded
Open this post in threaded view
|

Re: project vitality?

chrismattmann
In reply to this post by Richard Braman
Hi Richard,

> IMHO, if you don't parse something correctly, you cannnot rely on the
> results.  

Good, we're on the same page here.

> We have all parsed things where you leave a comma out and the parse
> results are wrong.  If there was a bug in nutches html parsing would
> that be a big deal?

Yes, it would be. HTML is the foundation for the web. Its content is the
most pervasive out there (as you allude to below).

> Howabout if it parsed the text in a particular tag
> out of order?  

I'm wondering what that has to do with anything? You may want to read up on
Lucene (http://lucene.apache.org/). Lucene is the underlying text search api
(and index format) that Nutch is built on top of, and I'm wondering if it
cares about the order in which a piece of text is given to it?

> Pdf is unfortunately not html where you can parse the
> file sequentially and get an accurate result,

Gonna have to disagree with you on this. You're making a general statement
that's not true across the board. I would assert that in many cases, you can
still get an accurate result. What about a PDF research paper? Do you care
about what order the text comes in if you're just doing general "Google
like" search. When I go to Google and type "grid computing papers", do I
care that "grid computing" comes before some text within the research paper?
Possibly, but mainly I care that "grid computing" was an emphasized phrase
within the text. Now, your definition of "emphasized" may not just be that
it's the first text that appears in the paper in the title say: you may just
care that the frequency of "grid computing" in the paper is relatively
higher than a certain threshold compared to other terms. On the other hand,
the fact that "grid computing" is in the title and comes first in the PDF
may mean a lot to you. in That's the nature of trying to extract structure
out of inherently unstructured content. I'm not saying that the structure or
order of text within a document is never useful: I agree that in a lot of
cases, it can help you to infer what values are associated with what fields
you want to index, etc. All I'm saying is that it's certainly a subset of
the greater functionality of just doing free text search, so you shouldn't
generalize and that that you can't parse a PDF sequentially and obtain good
results.

> but its use is second most
> ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has some
> pdf parsing algorithms, that aren't being used.  Google does a good job
> parsing pdf, nutch has to do if its ogin to compete.

Can you show that Google's PDF parsing capability is any better than Nutch's
using accepted evaluation methods for PDF? How about some real use cases and
real results? Until we could see such numbers, I'm hesitant to believe what
you're saying is true. If it is though, then I'm sure that the community
would welcome any updates to the PDF parsing plugin that expedite its
improvement.

Cheers,
  Chris



>
>
>
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Saturday, March 04, 2006 4:10 PM
> To: [hidden email]
> Subject: Re: project vitality?
>
>
> Hello,
>
>  I've been following this conversation for the past week and decided
> that I'd go ahead and chime in now. I think that honestly this whole
> thread of discussion needs to be taken off list, because it doesn't
> really have anything to do with the "use" of Nutch: what it boils down
> to is a list of complaints, requests for improvements and what not.
> Nutch's goal is to be a large-scale, open source search engine: it's not
> a PDF parsing framework, nor is it as thoroughly documented as some
> commercial software -- although I've ran into many commercial software
> products that don't have the same quality of documentation that Nutch
> even has now in its nascent stages.
>
>> Now that I have said that, I want to express my feeling that it's hard
>
>> when it takes a week to figure out that invertlinks only applies to
>> version 0.8. and when you ask to become a volunteer, you are met with
>> no response.
>
> You don't need to "ask" to become a volunteer: just do it. As Doug said,
> create a patch, submit the patch to JIRA and let the community look at
> it. Change something on the Wiki if you don't think that the
> documentation is particularly well there. Use Nutch to do whatever you
> like, and if you feel that you contributed something that is applicable
> to a broader community outside of your domain, let people know about it.
> If it's really cool, I wouldn't worry about people ignoring you: they'll
> come around.
>
>> It's also frustrating when you share some heard earned insights into
>> something that nutch needs to work on, like pdf parsing, and your
>> comments don't get a single good response from the nutch dev team.
>
> The nutch "dev team" isn't focused on PDF parsing. Nutch is a search
> engine framework, and to Nutch, a PDF parser is a "black box" that
> conforms to a standard parsing interface that can be swapped out as
> technology evolves. Right now, Nutch uses PDFBox, but in a week it could
> use "hot super new rad PDF parsing technology X.1", or some other
> greater PDF parser. If you feel that PDFBox isn't getting the job done
> for your particular domain, then post an actual question, not pointers
> to documents for the Nutch developers to go read. Honestly, I'm guessing
> they don't have the time, nor the desire to go read a whole bunch of PDF
> documentation unless there's a real use case, and a real need to upgrade
> the existing parser. Empirically show that Nutch's PDF capabilities
> aren't getting the job done, post your results to the list, and let the
> community look them. I'd guess you'd generate more interest and probably
> get a better response that way.
>
>>
>> Sometimes, in OS projects I get the feeling that the developers
>> breathe different air than users, and that our help is not wanted or
>> that our questions are stupid and not worth their time to answer.
>
> As far as I can tell the Nutch developers all breathe the same air as us
> (and moreover, I believe they put on their pants "one leg at a time")
>
>>
>> Nutch is nowhere near being a dead project, that is not what I said (I
>
>> said it was close, not closed), its just that I don't feel that it's
>> something that anyone can just download and use without running into
>> problems.
>
> Problems is a generic word: I would agree with your statement if you
> qualified what "problems" means. Small problems like configuration
> issues? I'd buy that. Exception messages not providing super super
> detailed information about the error? Sure, I'd even buy that in some
> cases. However, larger, bigger problems that generally fall in the class
> of "bugs"? I would say the answer to that is probably a "no".
>
>> Problems always exist, but need to be documented correctly so that
>> they can be solved quickly.  I think nutch has a long way to go before
>
>> it is comparable to tomcat or httpd, which are both production ready
>> and have literally volumes of information on using in every manner
>> possible.
>
> Check out the commiters list on Tomcat (
> http://tomcat.apache.org/whoweare.html) versus that of Nutch (
> http://lucene.apache.org/nutch/credits.html). 21 active commiters on the
> Tomcat PMC and many more emeritus commiters. Nutch has less than 10. To
> have the wealth of capability and functionality that Nutch provides,
> with the ability to deploy it in production quality environments (which
> I can assure you, after having been on the mailing lists for the better
> part of a year, there are plenty), and its ease of use, I would have to
> respectfully disagree with the majority of your assertions and say that
> the Nutch folks are doing a great job.
>
> Now, can we please take this discussion off the public mailing lists? I
> would think that the majority of folks on the list would like to move
> on. I know that I would.
>
> Cheers,
>   Chris
>
>
>>
>> I am sorry if you don't like my opinion or the way it is expressed.
>>
>> -----Original Message-----
>> From: carmmello [mailto:[hidden email]]
>> Sent: Saturday, March 04, 2006 10:54 AM
>> To: [hidden email]
>> Subject: RE: project vitality?
>>
>>
>> I really can not agree with the way Mr. Richard Braman express his
>> views.  I have tried Nutch since version 0.3 and I could not make the
>> 0.8 release  work (Nutch is becoming a little bit complicated with all
>
>> those map reduce, hadoop, and so on, that I can't deal with).  I
>> understand, however,  that if a product is not finished yet,  some
>> times it may fail with the lack of some fundamental documentation,
>> but, if there is a bunch of people who develops, for free, a product
>> that is commercially worth some thousands of dollars and may fit our
>> purposes, we have to say thanks.  After that we can, of course,
>> express our views, complaints and suggestions, but we should refrain
>> from some hard, non relevant comments, that goes nowhere, like this,
>> non technical, post of mine. I, myself, have my own experimental
>> implementation of Nutch 0.7.1.x (a nightly version), with more than
>> 400,000 pages, that can be, sometimes, viewed at brazilian working
>> hours, at http://www.qualidade.eng.br/constelacao.htm .  It is in
>> portuguese, but english terms related to quality, standards and
>> environment can be searched.
>>
>


Reply | Threaded
Open this post in threaded view
|

RE: parsing pdf correctly

Richard Braman
We also agree its a general statement that most PDF text is not in
sequential order.  and while definitely not true across the board, it is
definitely true more than it is untrue.

Maybe not at NASA (where the focus is on scientific research papers that
appeal to space types) but definitely in other government agencies where
publications are intended for a more general public audience.  And
definitely is newsletters, and other content where the presentation is
paramount.

PDF is even more layout oriented than html, meaning it often cares less
about the underlying data and focuses solely on presentation.  If the
web was entirely in XML we all know it would be much easier to parse,
but its not, the content is most often in html or PDF.  Html is chicken
to parse compared to PDF.  I have been parsing HTML for the last 10
years, but PDF has basically no underlying structure at all, and the
parsing methods are correspondingly harder.  Even emphasized text in
tags such as H1 don't have a PDF equivalent.  The only thing you can
truly rely on is the pdfs meta data (and what if the author omitted
that), or any tagged content, which PDFBox, and most other PDF parsers
(multivalent, jpedal) don't currently support either, mainly because so
little pdf content is tagged.  Although that may change at federal
agencies because of Section 508.

In many domain specfic searches pdf may be more ubiquotous than html,
especially in government, who puts almost everything in pdf nowadays.

I think it is pretty obvious that google's pdf parsing technology is
better than nutches.  Google converts each pdf into an html page and
stores them as such.  I would venture to guess that google runs the
resultant html page through its html parser in order to score the doc,
instead of just stripping text out.  Maybe I am wrong.

I will run the data once my crawl is complete and report on the results,
if data is what you need to be convinced.




-----Original Message-----
From: Chris Mattmann [mailto:[hidden email]]
Sent: Saturday, March 04, 2006 7:14 PM
To: [hidden email]
Subject: Re: project vitality?


Hi Richard,

> IMHO, if you don't parse something correctly, you cannnot rely on the
> results.

Good, we're on the same page here.

> We have all parsed things where you leave a comma out and the parse
> results are wrong.  If there was a bug in nutches html parsing would
> that be a big deal?

Yes, it would be. HTML is the foundation for the web. Its content is the
most pervasive out there (as you allude to below).

> Howabout if it parsed the text in a particular tag
> out of order?

I'm wondering what that has to do with anything? You may want to read up
on Lucene (http://lucene.apache.org/). Lucene is the underlying text
search api (and index format) that Nutch is built on top of, and I'm
wondering if it cares about the order in which a piece of text is given
to it?

> Pdf is unfortunately not html where you can parse the
> file sequentially and get an accurate result,

Gonna have to disagree with you on this. You're making a general
statement that's not true across the board. I would assert that in many
cases, you can still get an accurate result. What about a PDF research
paper? Do you care about what order the text comes in if you're just
doing general "Google like" search. When I go to Google and type "grid
computing papers", do I care that "grid computing" comes before some
text within the research paper? Possibly, but mainly I care that "grid
computing" was an emphasized phrase within the text. Now, your
definition of "emphasized" may not just be that it's the first text that
appears in the paper in the title say: you may just care that the
frequency of "grid computing" in the paper is relatively higher than a
certain threshold compared to other terms. On the other hand, the fact
that "grid computing" is in the title and comes first in the PDF may
mean a lot to you. in That's the nature of trying to extract structure
out of inherently unstructured content. I'm not saying that the
structure or order of text within a document is never useful: I agree
that in a lot of cases, it can help you to infer what values are
associated with what fields you want to index, etc. All I'm saying is
that it's certainly a subset of the greater functionality of just doing
free text search, so you shouldn't generalize and that that you can't
parse a PDF sequentially and obtain good results.

> but its use is second most
> ubiquotous.  PDFBox is not a PDF parsing frmaework either.  It has
> some pdf parsing algorithms, that aren't being used.  Google does a
> good job parsing pdf, nutch has to do if its ogin to compete.

Can you show that Google's PDF parsing capability is any better than
Nutch's using accepted evaluation methods for PDF? How about some real
use cases and real results? Until we could see such numbers, I'm
hesitant to believe what you're saying is true. If it is though, then
I'm sure that the community would welcome any updates to the PDF parsing
plugin that expedite its improvement.

Cheers,
  Chris



>
>
>
>
> -----Original Message-----
> From: Chris Mattmann [mailto:[hidden email]]
> Sent: Saturday, March 04, 2006 4:10 PM
> To: [hidden email]
> Subject: Re: project vitality?
>
>
> Hello,
>
>  I've been following this conversation for the past week and decided
> that I'd go ahead and chime in now. I think that honestly this whole
> thread of discussion needs to be taken off list, because it doesn't
> really have anything to do with the "use" of Nutch: what it boils down

> to is a list of complaints, requests for improvements and what not.
> Nutch's goal is to be a large-scale, open source search engine: it's
> not a PDF parsing framework, nor is it as thoroughly documented as
> some commercial software -- although I've ran into many commercial
> software products that don't have the same quality of documentation
> that Nutch even has now in its nascent stages.
>
>> Now that I have said that, I want to express my feeling that it's
>> hard
>
>> when it takes a week to figure out that invertlinks only applies to
>> version 0.8. and when you ask to become a volunteer, you are met with

>> no response.
>
> You don't need to "ask" to become a volunteer: just do it. As Doug
> said, create a patch, submit the patch to JIRA and let the community
> look at it. Change something on the Wiki if you don't think that the
> documentation is particularly well there. Use Nutch to do whatever you

> like, and if you feel that you contributed something that is
> applicable to a broader community outside of your domain, let people
> know about it. If it's really cool, I wouldn't worry about people
> ignoring you: they'll come around.
>
>> It's also frustrating when you share some heard earned insights into
>> something that nutch needs to work on, like pdf parsing, and your
>> comments don't get a single good response from the nutch dev team.
>
> The nutch "dev team" isn't focused on PDF parsing. Nutch is a search
> engine framework, and to Nutch, a PDF parser is a "black box" that
> conforms to a standard parsing interface that can be swapped out as
> technology evolves. Right now, Nutch uses PDFBox, but in a week it
> could use "hot super new rad PDF parsing technology X.1", or some
> other greater PDF parser. If you feel that PDFBox isn't getting the
> job done for your particular domain, then post an actual question, not

> pointers to documents for the Nutch developers to go read. Honestly,
> I'm guessing they don't have the time, nor the desire to go read a
> whole bunch of PDF documentation unless there's a real use case, and a

> real need to upgrade the existing parser. Empirically show that
> Nutch's PDF capabilities aren't getting the job done, post your
> results to the list, and let the community look them. I'd guess you'd
> generate more interest and probably get a better response that way.
>
>>
>> Sometimes, in OS projects I get the feeling that the developers
>> breathe different air than users, and that our help is not wanted or
>> that our questions are stupid and not worth their time to answer.
>
> As far as I can tell the Nutch developers all breathe the same air as
> us (and moreover, I believe they put on their pants "one leg at a
> time")
>
>>
>> Nutch is nowhere near being a dead project, that is not what I said
>> (I
>
>> said it was close, not closed), its just that I don't feel that it's
>> something that anyone can just download and use without running into
>> problems.
>
> Problems is a generic word: I would agree with your statement if you
> qualified what "problems" means. Small problems like configuration
> issues? I'd buy that. Exception messages not providing super super
> detailed information about the error? Sure, I'd even buy that in some
> cases. However, larger, bigger problems that generally fall in the
> class of "bugs"? I would say the answer to that is probably a "no".
>
>> Problems always exist, but need to be documented correctly so that
>> they can be solved quickly.  I think nutch has a long way to go
>> before
>
>> it is comparable to tomcat or httpd, which are both production ready
>> and have literally volumes of information on using in every manner
>> possible.
>
> Check out the commiters list on Tomcat (
> http://tomcat.apache.org/whoweare.html) versus that of Nutch (
> http://lucene.apache.org/nutch/credits.html). 21 active commiters on
> the Tomcat PMC and many more emeritus commiters. Nutch has less than
> 10. To have the wealth of capability and functionality that Nutch
> provides, with the ability to deploy it in production quality
> environments (which I can assure you, after having been on the mailing

> lists for the better part of a year, there are plenty), and its ease
> of use, I would have to respectfully disagree with the majority of
> your assertions and say that the Nutch folks are doing a great job.
>
> Now, can we please take this discussion off the public mailing lists?
> I would think that the majority of folks on the list would like to
> move on. I know that I would.
>
> Cheers,
>   Chris
>
>
>>
>> I am sorry if you don't like my opinion or the way it is expressed.
>>
>> -----Original Message-----
>> From: carmmello [mailto:[hidden email]]
>> Sent: Saturday, March 04, 2006 10:54 AM
>> To: [hidden email]
>> Subject: RE: project vitality?
>>
>>
>> I really can not agree with the way Mr. Richard Braman express his
>> views.  I have tried Nutch since version 0.3 and I could not make the

>> 0.8 release  work (Nutch is becoming a little bit complicated with
>> all
>
>> those map reduce, hadoop, and so on, that I can't deal with).  I
>> understand, however,  that if a product is not finished yet,  some
>> times it may fail with the lack of some fundamental documentation,
>> but, if there is a bunch of people who develops, for free, a product
>> that is commercially worth some thousands of dollars and may fit our
>> purposes, we have to say thanks.  After that we can, of course,
>> express our views, complaints and suggestions, but we should refrain
>> from some hard, non relevant comments, that goes nowhere, like this,
>> non technical, post of mine. I, myself, have my own experimental
>> implementation of Nutch 0.7.1.x (a nightly version), with more than
>> 400,000 pages, that can be, sometimes, viewed at brazilian working
>> hours, at http://www.qualidade.eng.br/constelacao.htm .  It is in
>> portuguese, but english terms related to quality, standards and
>> environment can be searched.
>>
>

12