Reviving Nutch 0.7

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Reviving Nutch 0.7

Otis Gospodnetic-2
Hi,

I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today.  However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward.

Thoughts?

Otis



Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Piotr Kosiorowski
Otis,
Some time ago people on the list said that they are willing to at
least maintain Nutch 0.7 branch. As a committer (not very active
recently) I volunteered to commit patches when they appear - I do not
have enough time at the moment to do active coding. I have created a
7.3 release in JIRA so we can start looking at it. So - we are ready
and willing to move Nutch 0.7 forward but it looks like there is no
interest at the moment.
Regards
Piotr

On 1/22/07, Otis Gospodnetic <[hidden email]> wrote:

> Hi,
>
> I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally.
>
> Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today.  However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?
>
> I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward.
>
> Thoughts?
>
> Otis
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Zaheed Haque
In reply to this post by Otis Gospodnetic-2
On 1/22/07, Otis Gospodnetic <[hidden email]> wrote:
> Hi,
>
> I've been meaning to write this message for a while, and Andrzej's StrategicGoals made me compose it, finally.
>
> Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop stabilizes, it will be even more valuable than it is today.  However, I think there is still a need for something much simpler, something like what Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.  Nutch has too few developers to maintain and further develop both of these concepts, and the main Nutch developers need the more powerful version - 0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?
>
> I feel that there is enough need for 0.7-style Nutch that it might be worth at least considering and discussing the possibility of somehow branching that version into a parallel project that's not just in a maintenance mode, but has its own group of developers (not me, no time :( ) that pushes it forward.
>
> Thoughts?

I agree with you that there is a need for 0.7-style Nutch. I wouldn't
say reviving but more "Disecting and re-directing" :-). here you go
--- my focus here is 0.7 style i.e. mid-size, enterprise need.

Solr could use a good crawler cos it has everything else .. (AFAIK)
probably this is not technically "plug an pray :-)" also I am not sure
Solr community wants a crawler but it could benefit from such Solr add
on/snap on crawler. Furthermore I am sure some of the 0.7 plugins
could be re-factored to fit into Solr.

I will forward the mail to Solr community to see if there any interest.

Cheers
Reply | Threaded
Open this post in threaded view
|

RE: Reviving Nutch 0.7

Alan Tanaman
In reply to this post by Otis Gospodnetic-2
Hello,

I'm writing this on behalf of both Armel Nene and myself.

We think that you and those who have responded have a point.  We've been
experiencing quite a number of problems with getting Nutch 0.8 adapted for
our needs, and making changes to support evolving business requirements as
they come up.

So much so, that we've considered replacing the "spine" of Nutch with our
own programs, which would still be compatible with the Nutch plugins (same
parameters etc.), but that would allow us more ease in making changes and
debug.  We've decided to lay out some of our challenges for you to consider.
 
Our major needs are the ability to deploy on large enterprise file systems
(1-10 Terabytes, large compared to average file systems, but small compared
to the WWW).  We also need to support http, but only specific web sites,
subscription web sites and so on.  We don't need to replicate a
generic-Google implementation.

The main features we are currently working on relate primarily to
near-real-time crawling, specifically:
- Incremental Crawling, where changes are monitored at the folder level,
which is much faster than fetching every URL and checking for a change.
Note that this is similar to adaptive crawling, but will be even more
efficient.
- Special handling for parsing of large files (possibly farming those out to
dedicated processors a-la Amazon).  Hadoop would be useful here, but we
would consider re-adding this at a later stage.
- Incremental Indexing, where documents are added to or removed from a live
index, instead of rebuilding a new index each time.

We would be happy to join a group of 0.7 developers, if that would enable us
to pursue this enterprise-based direction, which clearly has different
challenges than those facing WWW-crawling.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
http://blog.idna-solutions.com

-----Original Message-----
From: Otis Gospodnetic [mailto:[hidden email]]
Sent: 22 January 2007 06:48
To: Nutch Developer List
Subject: Reviving Nutch 0.7

Hi,

I've been meaning to write this message for a while, and Andrzej's
StrategicGoals made me compose it, finally.

Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
stabilizes, it will be even more valuable than it is today.  However, I
think there is still a need for something much simpler, something like what
Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm this.
Nutch has too few developers to maintain and further develop both of these
concepts, and the main Nutch developers need the more powerful version - 0.8
and beyond.  So, what is going to happen to 0.7?  Maintenance mode?

I feel that there is enough need for 0.7-style Nutch that it might be worth
at least considering and discussing the possibility of somehow branching
that version into a parallel project that's not just in a maintenance mode,
but has its own group of developers (not me, no time :( ) that pushes it
forward.

Thoughts?

Otis




Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Sami Siren-2
In reply to this post by Otis Gospodnetic-2
2007/1/22, Otis Gospodnetic <[hidden email]>:

>
> Hi,
>
> I've been meaning to write this message for a while, and Andrzej's
> StrategicGoals made me compose it, finally.
>
> Nutch 0.8 and beyond is very cool, very powerful, and once Hadoop
> stabilizes, it will be even more valuable than it is today.  However, I
> think there is still a need for something much simpler, something like what
> Nutch 0.7 used to be.  Fairly regular nutch-user inquiries confirm
> this.  Nutch has too few developers to maintain and further develop both of
> these concepts, and the main Nutch developers need the more powerful version
> - 0.8 and beyond.  So, what is going to happen to 0.7?  Maintenance mode?
>
> I feel that there is enough need for 0.7-style Nutch that it might be
> worth at least considering and discussing the possibility of somehow
> branching that version into a parallel project that's not just in a
> maintenance mode, but has its own group of developers (not me, no time :( )
> that pushes it forward.
>
> Thoughts?
>
>
Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
please consider the following:

One option would be re factoring the code in a way that the parts that are
usable to other projects like protocols?, parsers (this actually was
proposed by
Jukka Zitting some time last year) and stuff would be modified to be
independent
of nutch (and hadoop) code. Yeah, this is easy to say, but would require
significant amount of work.

The "more focused",smaller chunks of nutch would probably also get bigger
audience (perhaps also outside nutch land) and that way perhaps more people
willing to work for them.

Don't know about others but at least I would be more willing to work towards
this goal than the one where there would be practically many separate
projects,
each sharing common functionality but different code base.

--
 Sami Siren
Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

chrismattmann
 

> Before doubling (or after 0.9.0 tripling?) the maintenance/development  work
> please consider the following:
>
> One option would be re factoring the code in a way that the parts that are
> usable to other projects like protocols?, parsers (this actually was
> proposed by
> Jukka Zitting some time last year) and stuff would be modified to be
> independent
> of nutch (and hadoop) code. Yeah, this is easy to say, but would require
> significant amount of work.
>
> The "more focused",smaller chunks of nutch would probably also get bigger
> audience (perhaps also outside nutch land) and that way perhaps more people
> willing to work for them.
>
> Don't know about others but at least I would be more willing to work towards
> this goal than the one where there would be practically many separate
> projects,
> each sharing common functionality but different code base.

+1 ;)

This was actually the project proposed by Jerome Charron and myself, called
"Tika". We went so far as to create a project proposal, and send it out to
the nutch-dev list, as well as the Lucene PMC for potential Lucene
sub-project goodness. I could probably dig up the proposal should the need
arise.

Good ol' Jukka then took that effort and created us a project within Google
code, that still lives in there in fact:

http://code.google.com/p/tika/

There hasn't be active development on it because:

1. None of us (I'm speaking for Jerome, and myself here) ended up having the
time to shepherd it going forward

2. There was little, if any response, from the proposal to the nutch-dev
list, and folks willing to contribute (besides people like Jukka)

3. I think, as you correctly note above, most people thought it to be too
much of a Herculean effort that wouldn't pay the necessary dividends in the
end to undertake it


In any case, I think that, if we are going to maintain separate branches of
the source, in fact, really parallel projects, then an undertaking such as
Tika is properly needed ...

Cheers,
   Chris




>
> --
>  Sami Siren


Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Sami Siren-2
Chris Mattmann wrote:
> In any case, I think that, if we are going to maintain separate branches of
> the source, in fact, really parallel projects, then an undertaking such as
> Tika is properly needed ...

I still don't think we need separate project to start with, IMO right
mode of mind is enough to get going. If people thing this is right
direction and it goes beyond talk then perhaps after that we could start
talking about separate project.


--
 Sami Siren


Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2
Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled.  But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented.  In my mind, one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, etc.  That's the branch that's in the trunk.  The other branch is a simpler branch without all that Hadoop stuff, for folks who need to fetch, index, and search a few hundred thousand or a few million or even a few tens of millions of pages, and don't need replication, etc. that comes with Hadoop.  That branch could be based off of 0.7.  I also know that a lot of people are trying to use Nutch to build vertical search engines, so there is also a need for a focused fetcher.  Kelvin Tan brought this up a few times, too, I believe.

I *think* there is a need for that.
I *can't* help shepherd this, but wanted to bring this up, in case there are people lurking who want to work on this.

Otis

----- Original Message ----
From: Sami Siren <[hidden email]>
To: [hidden email]
Sent: Monday, January 22, 2007 10:52:47 AM
Subject: Re: Reviving Nutch 0.7

Chris Mattmann wrote:
> In any case, I think that, if we are going to maintain separate branches of
> the source, in fact, really parallel projects, then an undertaking such as
> Tika is properly needed ...

I still don't think we need separate project to start with, IMO right
mode of mind is enough to get going. If people thing this is right
direction and it goes beyond talk then perhaps after that we could start
talking about separate project.


--
 Sami Siren





Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Doug Cutting
[hidden email] wrote:
> Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled.  But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented.  In my mind, one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, etc.  That's the branch that's in the trunk.  The other branch is a simpler branch without all that Hadoop stuff, for folks who need to fetch, index, and search a few hundred thousand or a few million or even a few tens of millions of pages, and don't need replication, etc. that comes with Hadoop.  That branch could be based off of 0.7.  I also know that a lot of people are trying to use Nutch to build vertical search engines, so there is also a need for a focused fetcher.  Kelvin Tan brought this up a few times, too, I believe.

Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch:
everything should run fine in a single process by default.  If there are
bugs in this they should be logged, folks who care should submit
high-quality, back-compatible, generally useful patches, and committers
should work to get these patches committed to the trunk.

Second, if there are to be two modes of operation, wouldn't they best be
developed in a common source tree, so that they share as much as
possible and diverge as little as possible?  It seems to me that a good
architecture would be to agree on a common high-level API, then use two
different runtimes underneath, one to support distributed operation, and
one to support standalone operation.  Hey!  That's what Hadoop already
does!  Maybe it's not perfect and someone can propose a better way to
share maximal amounts of code, but the code split should probably be
into different classes and packages in a single source tree maintained
by a single community of developers, not by branching a single source
tree in a revision control and splitting the developers.

Third, part of the problem seems like there are two few
contributors--that the challenges are big and the resources limited.
Splitting the project will only spread those resources more thinly.

What really is the issue here?  Are good patches languishing?  Are there
patches that should be committed (meet coding standards, are
back-compatible, generally useful, etc.) but are not?  A great patch is
one that a committer can commit it with few worries: it includes new
unit tests, it passes all existing unit tests, it fixes one thing only,
etc.  Such patches should not have to wait long for commit.  And once
someone submits a few such patches, then one should be invited to become
a committer.

It sounds to me like the problem is that, off-the-shelf, Nutch does not
yet solve all the problems folks would like it too: e.g., it has never
done a good job with incremental indexing.  Folks see progress made on
scalability, but really wish it were making more progress on
incrementality or something else.  But it's not going to make progress
on incrementality without someone doing the work.  A fork or a branch
isn't going to do the work.  I don't see any reason that the work cannot
be done right now.  It can be done incrementally: e.g., if the web db
API seems inappropriate for incremental updates, then someone should
submit a patch that provides an incremental web db API, updating the
fetcher and indexer to use this.  A design for this on the wiki would be
a good place to start.

Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages.  Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster.  So be careful before you
threw those features out: you might want them back soon.

Doug


Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

AJ Chen-2
On 1/22/07, Doug Cutting <[hidden email]> wrote:

>
>
> Finally, web crawling, indexing and searching are data-intensive.
> Before long, users will want to index tens or hundreds of millions of
> pages.  Distributed operation is soon required at this scale, and
> batch-mode is an order-of-magnitude faster.  So be careful before you
> threw those features out: you might want them back soon.
>
> Doug
>
>
> As a developer building application on top of Nutch, my experience is that
I can't go back to version 0.7x because the features in version 0.8/0.9 are
so much needed even for non-distributed crawling/indexing. For example, I
can run crawling/indexing on a linux server and a windows laptop separately,
and merge newly crawled databases into the main crawldb. I remember
v0.7can't merge separate crawldb without lots of customization.

It may takes some time to switch from 0.7x to v0.8/0.9 especially if you
have lots of customization code. But, once you get over this one hurdle, you
will enjoy the new and better features in 0.8/0.9 version.  Also, this may
be the time to re-think the design of your application. For my own project,
I always try to separate my code from nutch core code as much as possible so
that I can easily upgrade the application to keep up with new nutch release.
Keeping away from the newest nutch version is somewhat backward to me.

AJ
--
AJ Chen, PhD
Palo Alto, CA
http://web2express.org
Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Otis Gospodnetic-2-2
In reply to this post by Otis Gospodnetic-2
All good arguments, and as nobody else voiced the desire to have this other branch of Nutch I was rambling about, I'll consider this thread done.
Thanks for the explanations, Doug.

Otis

----- Original Message ----
From: Doug Cutting <[hidden email]>
To: [hidden email]
Sent: Monday, January 22, 2007 1:40:30 PM
Subject: Re: Reviving Nutch 0.7

[hidden email] wrote:
> Yes, certainly, anything that can be shared and decoupled from pieces that make each branch (not SVN/CVS branch) different, should be decoupled.  But I was really curious about whether people think this is a valid idea/direction, not necessarily immediately how things should be implemented.  In my mind, one branch is the branch that runs on top of Hadoop, with NameNode, DataNode, HDFS, etc.  That's the branch that's in the trunk.  The other branch is a simpler branch without all that Hadoop stuff, for folks who need to fetch, index, and search a few hundred thousand or a few million or even a few tens of millions of pages, and don't need replication, etc. that comes with Hadoop.  That branch could be based off of 0.7.  I also know that a lot of people are trying to use Nutch to build vertical search engines, so there is also a need for a focused fetcher.  Kelvin Tan brought this up a few times, too, I believe.

Branching doesn't sound like the right solution here.

First, one doesn't need to run any Hadoop daemons to use Nutch:
everything should run fine in a single process by default.  If there are
bugs in this they should be logged, folks who care should submit
high-quality, back-compatible, generally useful patches, and committers
should work to get these patches committed to the trunk.

Second, if there are to be two modes of operation, wouldn't they best be
developed in a common source tree, so that they share as much as
possible and diverge as little as possible?  It seems to me that a good
architecture would be to agree on a common high-level API, then use two
different runtimes underneath, one to support distributed operation, and
one to support standalone operation.  Hey!  That's what Hadoop already
does!  Maybe it's not perfect and someone can propose a better way to
share maximal amounts of code, but the code split should probably be
into different classes and packages in a single source tree maintained
by a single community of developers, not by branching a single source
tree in a revision control and splitting the developers.

Third, part of the problem seems like there are two few
contributors--that the challenges are big and the resources limited.
Splitting the project will only spread those resources more thinly.

What really is the issue here?  Are good patches languishing?  Are there
patches that should be committed (meet coding standards, are
back-compatible, generally useful, etc.) but are not?  A great patch is
one that a committer can commit it with few worries: it includes new
unit tests, it passes all existing unit tests, it fixes one thing only,
etc.  Such patches should not have to wait long for commit.  And once
someone submits a few such patches, then one should be invited to become
a committer.

It sounds to me like the problem is that, off-the-shelf, Nutch does not
yet solve all the problems folks would like it too: e.g., it has never
done a good job with incremental indexing.  Folks see progress made on
scalability, but really wish it were making more progress on
incrementality or something else.  But it's not going to make progress
on incrementality without someone doing the work.  A fork or a branch
isn't going to do the work.  I don't see any reason that the work cannot
be done right now.  It can be done incrementally: e.g., if the web db
API seems inappropriate for incremental updates, then someone should
submit a patch that provides an incremental web db API, updating the
fetcher and indexer to use this.  A design for this on the wiki would be
a good place to start.

Finally, web crawling, indexing and searching are data-intensive.
Before long, users will want to index tens or hundreds of millions of
pages.  Distributed operation is soon required at this scale, and
batch-mode is an order-of-magnitude faster.  So be careful before you
threw those features out: you might want them back soon.

Doug





Reply | Threaded
Open this post in threaded view
|

RE: Reviving Nutch 0.7

Alan Tanaman
Doug Cutting wrote:
> Branching doesn't sound like the right solution here. ...

I couldn't agree more that the not-splitting-up approach is indeed better
for resource-utilization, but how do we get round the problems that we keep
encountering?

We haven't managed to run a script without Hadoop popping up to do the
map/reduce.  And many of the problems we have encountered in debugging with
the Hadoop interaction are probably down to a lack of understanding on our
part of how Hadoop works.

It is also clear to me that most Nutch developers and users are quite happy
with the direction it is taking -- it is after all intended to be used,
first and foremost as a web-crawler a-la Google, as opposed to the way we
use it for enterprise file-systems, databases, and very specific small-scope
web crawling.

[hidden email] wrote:
> All good arguments, and as nobody else voiced the desire to have this
> other branch of Nutch I was rambling about, I'll consider this thread
> done.

Unfortunately, at the current juncture, we have spent so much time trying to
work around the problems that we encountered with 0.8.1, that have had to
roll back to 0.7.2, albeit temporarily.

We would however, be most happy to rejoin the team effort at 0.9 once our
pressing issues have been resolved, and if we can somehow find a way to add
classes/methods to the architecture to cater more successfully for
enterprise search.

Best regards,
Alan
_________________________
Alan Tanaman
iDNA Solutions
Tel: +44 (20) 7257 6125
Mobile: +44 (7796) 932 362
http://blog.idna-solutions.com



Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

Nutch Newbie
In reply to this post by Doug Cutting
Doug:

I agree with all of your comment except the following..

> Third, part of the problem seems like there are two few
> contributors--that the challenges are big and the resources limited.
> Splitting the project will only spread those resources more thinly.

IMHO, there are lot of duplicated effort (i.e off and on the FOSS domain).
Crawling, file parsing,  analyzers, incremental indexing  etc. are a common
discussion topic on every Lucene mailing list. Which makes resources spread
across many duplicated effort instead of having a common High-level agreed API.

Instead of branching/creating new project it is more efficient to develop libs
(i.e Nutch crawler, analyzer etc..) so that other projects (on or off
FOSS domain)
can re-use them i.e code base sharing should be easy Not difficult.
Exactly the same reason NDFS became Hadoop. Now anyone can read the
Hadoop API and combine it with Lucene
without much trouble to run Lucene Index engine on top of Hadoop.

A crawler or analyzer can be re-used in the same manner as above. Same goes for
Indexing or searching .. As you pointed out previously ...

http://www.gossamer-threads.com/lists/lucene/general/41211

Again not really proposing a new project but more easy to use
re-usable code. IMHO, Nutch will be an umbrella project for
"ala-Google" and Solr will be for "ala-Enterpise"  where Lucene
is the index lib, Hadoop is the Mapred/DFS lib ..what is missing is
Common Crawler lib, Common
indexing lib etc..

Regards
Reply | Threaded
Open this post in threaded view
|

Re: Reviving Nutch 0.7

J. Delgado
Nutch Newbie wrote:
> Again not really proposing a new project but more easy to use
> re-usable code. IMHO, Nutch will be an umbrella project for
> "ala-Google" and Solr will be for "ala-Enterpise"  where Lucene
> is the index lib, Hadoop is the Mapred/DFS lib ..what is missing is
> Common Crawler lib, Common
> indexing lib etc..

EXACTLY!

-- Joaquin