Next Generation Nutch

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Next Generation Nutch

Dennis Kubes-2
I have been thinking about a next generation Nutch for a while now, had
some talks with some of the other committers, and have gotten around to
putting some thoughts / requirements down on paper.  I wanted to run
these by the community and get feedback.  This message will be a bit
long so please bear with me.

First let me define that I think that the purpose of Nutch is to be a
web search engine.  When I say that I mean to specifically exclude
enterprise search.  By web search I am talking about general or vertical
search engines in the 1M-20B document range.  I am excluding things such
as database centric search and possibly even local filesystem search.
IMO Solr is a very capable enterprise search product and could handle
local filesystem search (if it doesn't already) and Nutch shouldn't try
to overlap functionality.  I think it should be able to interact, maybe
share indexes yes, but not overlap purpose.  I think that Nutch should
be designed to handle large datasets, meaning it has the ability to
scale to billions, perhaps 10s of billions of pages.  Hadoop already
gives us this capability for processing but Nutch would need to improve
on the search server and shard management side  of things to be able to
scale to the billion page level.  So the next generation of Nutch I
think should focus on web scale search.

After working with Hadoop and MapReduce for the last couple of years I
find it interesting just how similar development of MapReduce programs
seem to be to the linux/unix philosophy of small programs chained
together to accomplish big things.  So going forward I see this as a
healthy overall general architecture.  Nutch would have many small tools
that would be linked through data structures.  We already do this to
some extent in the current version of Nutch, an example of which would
be the tools that generate and act on CrawlDatum objects (CrawlDb,
UpdateDb, etc.).  I would like to keep that idea of tools and data
structures wth the tools are chained together perhaps only by shell or
management scripts, in different pipelines acting on the data
structures.  When I say data structure I don't mean binary map or
sequence files.  These may be a standard way to store these objects but
Hadoop allows any input / output formats whether that be to HBase, a
relational database, a local filesytem.  I think we should be open to
have those data structures stored however is best for the user through
different hadoop formats.  So a general overall architecture of tools
and data structures and pipelines of these tools.

I currently see five or six distinct phases to a web search engine.
They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
  Ok shard management might not be so much a phase as a functionality.
Acquire is simply the acquisition of the document be it PDF, HTML, or
images.  This would usually be the crawler phase.  Parse is parsing that
content into useful and standard data structures.  I do believe that
parsing should be separate and distinct from crawling.  If you crawl 50%
of 5M pages and the crawler dies, you should still be able to use that
50% you crawled.  Analyze is what we do with the content once it is
parsed into a standard structure we can use.  This could be anything
from a better link analysis to natural language processing, language
identification, and machine learning.  The analysis phase should
probably have an ever expanding set of tools for different purposes.
These tools would create specialized data structures of their own.
Eventually through all the analysis we come up with a score for a given
piece of content.  That could be a document or a field.  Indexing is the
process of taking the analysis scores and content and creating the
indexes for searching.  Searching is concerned with the searching of the
indexes.  This should be doable from command line, web based, or other
ways.  Shard management is concerned with the deployment and management
of large number of indexes.

I think the next generation of nutch should allow the changing of
different tools in any of these areas.  What this means is the ability
to have different components such as web crawlers (as long as the end
data structure is the same), for example Fetcher, Fetcher2, Grub,
Heretrix, or even specialized crawlers.  And different components for
different analysis types.  I don't see a lot of cross-cutting concerns
here.  And where there is, url normalization for example, I think it can
be handled better through dependency injection.

Which brings me to three.  I think it is time to get rid of the plugin
framework.  I want to keep the functionality of the various plugins but
I think a dependency injection framework, such as spring, creating the
components needed for logic inside of tools is a much cleaner way to
proceed.  This would allow much better unit and mock testing of tool and
logic functionality.  It would allow Nutch to run on a non "nutchified"
Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
core jars and contrib jars and a contrib directory which is pulled from
by shell scripts when submitting jobs to Hadoop.  With the
multiple-resources functionality in Hadoop it would be a simple matter
of creating the correct command lines for the job to run.

And that brings me to separation of data and presentation.  Currently
the Nutch website is one monolithic jsp application with plugins.  I
think the next generation should segment that out into xml / json feeds
and a separate front end that uses those feeds.  Again this would make
it much easier to create web applications using nutch.

And of course I think that shard management, a la Hadoop master and
slave style, is a big requirement as well.  I also think a full test
suite with mock objects and local and MiniMR and MiniDFS cluster testing
is important as is better documentation and tutorials (maybe even a book
:)).  So up to this point I have created MapReduce jobs that use spring
for dependency injection and it is simple and works well.  The above is
the direction I would like to head down but I would also like to see
what everyone else is thinking.

Dennis








Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

John Mendenhall
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.

I have not been using nutch as long as most everyone else
here on the list (just since mid last year).  I have written
a handful of plugins.  The current system seems to work well.

However, I am a strong proponent of the unix method of systems.
I strongly believe a system can be more flexible, more usable by
more users, more customizable, when each subsystem can be run
independent of each other.

As long as the "feeds" are atomic enough in nature to be able to
insert other modifying or filtering tools between the main nutch
tools, I find this to be a better overall solution in the long
run.

Assuming this is the way Nutch moves forward, do we allow Nutch
to stay as-is, with plugins and all, and create a new project?
Or, do we not worry about abandoning the current setup and
changing it en masse?

JohnM

--
john mendenhall
[hidden email]
surf utopia
internet services
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

chrismattmann
In reply to this post by Dennis Kubes-2
Hi Dennis,

Thanks for putting this together. I think that it's also important to add to
this list the ability to cleanly separate out the following major
components:

1. The underlying distributed computing infrastructure (e.g., why does it
have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
or what about even grid computing technologies, and web services? Hadoop can
certainly be _the_ core implementation of the underlying substrate, but the
ability to change this out should be a lot easier than it currently is. Read
on below to see what I mean.)

2. The crawler. Right now I think it's much too tied to the underlying
orchestration process and infrastructure.

3. The data structures. You do mention this below, but I would add to it
that the data structures for Nutch should be simple POJOs and not have any
tie to the underlying infrastructure (e.g., no need for Writeable methods,
etc.)

I think that with these types of guiding principles above, along with what
you mention below, there is the potential here to generate a really
flexible, reusable architecture, that, when folks come along and mention,
"Oh I've written Crawler XXX, how do I integrate it into Nutch", we don't
have to come back and say that the entire system has to be changed; or even
worse, that it cannot be done at all.

My 2 cents,
 Chris
 


On 4/11/08 2:59 PM, "Dennis Kubes" <[hidden email]> wrote:

> I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine.
> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>   Ok shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Otis Gospodnetic-2
In reply to this post by Dennis Kubes-2
Hello,

A few quick comments.  I don't know how much you track Solr, but the mention of shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki.  You'll want to check those out.  In short, Solr has the notion of shards and distributed search, kind of like Nutch with its RPC framework and searchers.  *That* is one big duplication of work, IMHO.  As far as the indexing+searching+shards go, I think one direction worth looking at carefully would be the gentle Nutch->Solr relationship -- using Solr to do indexing and searching.  Shard management doesn't exist in either project yet, but I think it would be ideal to come up with a common management mechanism, if possible.

I think this addresses your "... but Nutch would need to improve
on the search server and shard management side  of things to be able to
scale to the billion page level.  So the next generation of Nutch I
think should focus on web scale search." statement.

I know of a well-known, large corporation evaluating Solr (and its dist. search in particular) to handle 1-2B docs and 100 QPS.

I don't fully follow the part about getting rid of plugins, spring, etc., so I can't comment.

Regarding the webapp - perhaps Solr and SolrJ could be used here.  Solr itself is a webapp, and it contains various ResponseWriters that can output XML, JSON, pure Ruby, Python, even binary responses (in JIRA).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dennis Kubes <[hidden email]>
To: [hidden email]
Sent: Friday, April 11, 2008 5:59:41 PM
Subject: Next Generation Nutch

I have been thinking about a next generation Nutch for a while now, had
some talks with some of the other committers, and have gotten around to
putting some thoughts / requirements down on paper.  I wanted to run
these by the community and get feedback.  This message will be a bit
long so please bear with me.

First let me define that I think that the purpose of Nutch is to be a
web search engine.  When I say that I mean to specifically exclude
enterprise search.  By web search I am talking about general or vertical
search engines in the 1M-20B document range.  I am excluding things such
as database centric search and possibly even local filesystem search.
IMO Solr is a very capable enterprise search product and could handle
local filesystem search (if it doesn't already) and Nutch shouldn't try
to overlap functionality.  I think it should be able to interact, maybe
share indexes yes, but not overlap purpose.  I think that Nutch should
be designed to handle large datasets, meaning it has the ability to
scale to billions, perhaps 10s of billions of pages.  Hadoop already
gives us this capability for processing but Nutch would need to improve
on the search server and shard management side  of things to be able to
scale to the billion page level.  So the next generation of Nutch I
think should focus on web scale search.

After working with Hadoop and MapReduce for the last couple of years I
find it interesting just how similar development of MapReduce programs
seem to be to the linux/unix philosophy of small programs chained
together to accomplish big things.  So going forward I see this as a
healthy overall general architecture.  Nutch would have many small tools
that would be linked through data structures.  We already do this to
some extent in the current version of Nutch, an example of which would
be the tools that generate and act on CrawlDatum objects (CrawlDb,
UpdateDb, etc.).  I would like to keep that idea of tools and data
structures wth the tools are chained together perhaps only by shell or
management scripts, in different pipelines acting on the data
structures.  When I say data structure I don't mean binary map or
sequence files.  These may be a standard way to store these objects but
Hadoop allows any input / output formats whether that be to HBase, a
relational database, a local filesytem.  I think we should be open to
have those data structures stored however is best for the user through
different hadoop formats.  So a general overall architecture of tools
and data structures and pipelines of these tools.

I currently see five or six distinct phases to a web search engine.
They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
  Ok shard management might not be so much a phase as a functionality.
Acquire is simply the acquisition of the document be it PDF, HTML, or
images.  This would usually be the crawler phase.  Parse is parsing that
content into useful and standard data structures.  I do believe that
parsing should be separate and distinct from crawling.  If you crawl 50%
of 5M pages and the crawler dies, you should still be able to use that
50% you crawled.  Analyze is what we do with the content once it is
parsed into a standard structure we can use.  This could be anything
from a better link analysis to natural language processing, language
identification, and machine learning.  The analysis phase should
probably have an ever expanding set of tools for different purposes.
These tools would create specialized data structures of their own.
Eventually through all the analysis we come up with a score for a given
piece of content.  That could be a document or a field.  Indexing is the
process of taking the analysis scores and content and creating the
indexes for searching.  Searching is concerned with the searching of the
indexes.  This should be doable from command line, web based, or other
ways.  Shard management is concerned with the deployment and management
of large number of indexes.

I think the next generation of nutch should allow the changing of
different tools in any of these areas.  What this means is the ability
to have different components such as web crawlers (as long as the end
data structure is the same), for example Fetcher, Fetcher2, Grub,
Heretrix, or even specialized crawlers.  And different components for
different analysis types.  I don't see a lot of cross-cutting concerns
here.  And where there is, url normalization for example, I think it can
be handled better through dependency injection.

Which brings me to three.  I think it is time to get rid of the plugin
framework.  I want to keep the functionality of the various plugins but
I think a dependency injection framework, such as spring, creating the
components needed for logic inside of tools is a much cleaner way to
proceed.  This would allow much better unit and mock testing of tool and
logic functionality.  It would allow Nutch to run on a non "nutchified"
Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
core jars and contrib jars and a contrib directory which is pulled from
by shell scripts when submitting jobs to Hadoop.  With the
multiple-resources functionality in Hadoop it would be a simple matter
of creating the correct command lines for the job to run.

And that brings me to separation of data and presentation.  Currently
the Nutch website is one monolithic jsp application with plugins.  I
think the next generation should segment that out into xml / json feeds
and a separate front end that uses those feeds.  Again this would make
it much easier to create web applications using nutch.

And of course I think that shard management, a la Hadoop master and
slave style, is a big requirement as well.  I also think a full test
suite with mock objects and local and MiniMR and MiniDFS cluster testing
is important as is better documentation and tutorials (maybe even a book
:)).  So up to this point I have created MapReduce jobs that use spring
for dependency injection and it is simple and works well.  The above is
the direction I would like to head down but I would also like to see
what everyone else is thinking.

Dennis











Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Otis Gospodnetic-2
In reply to this post by Dennis Kubes-2
Hi,

Hm, I have to say I'm not sure if I agree 100% with part 1.  I think it would be great to have such flexibility, but I wonder if trying to achieve it would be over-engineering.  Do people really need that?  I don't know, maybe!  If they do, then ignore my comment. :)

I'm curious about 2. - could you please explain a little what you mean by "too tied to the underlying
orchestration process and infrastructure."?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Chris Mattmann <[hidden email]>
To: [hidden email]
Sent: Friday, April 11, 2008 9:10:30 PM
Subject: Re: Next Generation Nutch

Hi Dennis,

Thanks for putting this together. I think that it's also important to add to
this list the ability to cleanly separate out the following major
components:

1. The underlying distributed computing infrastructure (e.g., why does it
have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
or what about even grid computing technologies, and web services? Hadoop can
certainly be _the_ core implementation of the underlying substrate, but the
ability to change this out should be a lot easier than it currently is. Read
on below to see what I mean.)

2. The crawler. Right now I think it's much too tied to the underlying
orchestration process and infrastructure.

3. The data structures. You do mention this below, but I would add to it
that the data structures for Nutch should be simple POJOs and not have any
tie to the underlying infrastructure (e.g., no need for Writeable methods,
etc.)

I think that with these types of guiding principles above, along with what
you mention below, there is the potential here to generate a really
flexible, reusable architecture, that, when folks come along and mention,
"Oh I've written Crawler XXX, how do I integrate it into Nutch", we don't
have to come back and say that the entire system has to be changed; or even
worse, that it cannot be done at all.

My 2 cents,
 Chris
 


On 4/11/08 2:59 PM, "Dennis Kubes" <[hidden email]> wrote:

> I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine.
> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>   Ok shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.





Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

chrismattmann
Hi Otis,

Thanks for your comments. My responses inline below:

>
> Hm, I have to say I'm not sure if I agree 100% with part 1.  I think it would
> be great to have such flexibility, but I wonder if trying to achieve it would
> be over-engineering.  Do people really need that?  I don't know, maybe!  If
> they do, then ignore my comment. :)

Well, in the past, at least in my experience, this is exactly what has paid
off for us. Having the flexibility to architect a system that isn't tied to
the underlying technology. We once had a situation at JPL where a software
product was using CORBA for its underlying middleware implementation
framework. This (previously free) CORBA solution turned into a 30K/year
licensed solution, at the direction of the vendor in a 1 week timeframe.
Because we had architected and engineered our software system to be
independent of the underlying middleware substrate, we were able to switch
over to a free, Java-RMI based solution in the matter of a weekend.

Of course, this is typically bound to certain classes of underlying
substrates, and middleware solutions (e.g., it would be difficult to switch
out certain middlewares with vastly different architectural styles, say, if
we were trying to switch from CORBA to a P2P based solution like JXTA), but
all I'm saying is that it would be great if we didn't have to dictate to a
potential Nutch 2.0 user that to use our scalable, open source search engine
solution, you have to change from a JMS house to a Hadoop house. It would be
nice to say that we've architected Nutch 2.0 to be independent of the
underlying middleware provider. Of course, we can provide a default
implementation based on the existing Hadoop substrate, but we should provide
interfaces, data components, and architectural guidelines to be able to
change to say, a Nutch solution over XML-RPC, or Web-Services, or JMS,
without breaking the core architecture. Right now, I'm convinced that can't
be done, or in other words, it's too hard to tease the Hadoop notions out of
Nutch as it exists today.

>
> I'm curious about 2. - could you please explain a little what you mean by "too
> tied to the underlying
> orchestration process and infrastructure."?

What I mean by this is that the Fetcher/Fetcher2 dictates the orchestration
process for crawling: there is no separate, independent Nutch crawler.
Fetcher2 itself is a MapRunnable job (e.g., a term from the Hadoop
vocabulary). In my mind, the crawler process needs to be a separate
subsystem in Nutch, independent of the underlying middleware substrate (kind
of like I'm suggesting above). As an example: how would we take the existing
Nutch Fetcher2, and run it over JMS? Or XML-RPC? Or RMI?

So, I guess that's all I'm saying -- the Nutch 2.0 architecture should be
clearly insulated from the underlying middleware technology. That's my main
concern moving forward.

Hope that helps to explain my point of view. :) If not, let me know and I
would be happy to chat more about it. Thanks!

Cheers,
 Chris


>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Chris Mattmann <[hidden email]>
> To: [hidden email]
> Sent: Friday, April 11, 2008 9:10:30 PM
> Subject: Re: Next Generation Nutch
>
> Hi Dennis,
>
> Thanks for putting this together. I think that it's also important to add to
> this list the ability to cleanly separate out the following major
> components:
>
> 1. The underlying distributed computing infrastructure (e.g., why does it
> have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
> or what about even grid computing technologies, and web services? Hadoop can
> certainly be _the_ core implementation of the underlying substrate, but the
> ability to change this out should be a lot easier than it currently is. Read
> on below to see what I mean.)
>
> 2. The crawler. Right now I think it's much too tied to the underlying
> orchestration process and infrastructure.
>
> 3. The data structures. You do mention this below, but I would add to it
> that the data structures for Nutch should be simple POJOs and not have any
> tie to the underlying infrastructure (e.g., no need for Writeable methods,
> etc.)
>
> I think that with these types of guiding principles above, along with what
> you mention below, there is the potential here to generate a really
> flexible, reusable architecture, that, when folks come along and mention,
> "Oh I've written Crawler XXX, how do I integrate it into Nutch", we don't
> have to come back and say that the entire system has to be changed; or even
> worse, that it cannot be done at all.
>
> My 2 cents,
>  Chris
>  
>
>
> On 4/11/08 2:59 PM, "Dennis Kubes" <[hidden email]> wrote:
>
>> I have been thinking about a next generation Nutch for a while now, had
>> some talks with some of the other committers, and have gotten around to
>> putting some thoughts / requirements down on paper.  I wanted to run
>> these by the community and get feedback.  This message will be a bit
>> long so please bear with me.
>>
>> First let me define that I think that the purpose of Nutch is to be a
>> web search engine.  When I say that I mean to specifically exclude
>> enterprise search.  By web search I am talking about general or vertical
>> search engines in the 1M-20B document range.  I am excluding things such
>> as database centric search and possibly even local filesystem search.
>> IMO Solr is a very capable enterprise search product and could handle
>> local filesystem search (if it doesn't already) and Nutch shouldn't try
>> to overlap functionality.  I think it should be able to interact, maybe
>> share indexes yes, but not overlap purpose.  I think that Nutch should
>> be designed to handle large datasets, meaning it has the ability to
>> scale to billions, perhaps 10s of billions of pages.  Hadoop already
>> gives us this capability for processing but Nutch would need to improve
>> on the search server and shard management side  of things to be able to
>> scale to the billion page level.  So the next generation of Nutch I
>> think should focus on web scale search.
>>
>> After working with Hadoop and MapReduce for the last couple of years I
>> find it interesting just how similar development of MapReduce programs
>> seem to be to the linux/unix philosophy of small programs chained
>> together to accomplish big things.  So going forward I see this as a
>> healthy overall general architecture.  Nutch would have many small tools
>> that would be linked through data structures.  We already do this to
>> some extent in the current version of Nutch, an example of which would
>> be the tools that generate and act on CrawlDatum objects (CrawlDb,
>> UpdateDb, etc.).  I would like to keep that idea of tools and data
>> structures wth the tools are chained together perhaps only by shell or
>> management scripts, in different pipelines acting on the data
>> structures.  When I say data structure I don't mean binary map or
>> sequence files.  These may be a standard way to store these objects but
>> Hadoop allows any input / output formats whether that be to HBase, a
>> relational database, a local filesytem.  I think we should be open to
>> have those data structures stored however is best for the user through
>> different hadoop formats.  So a general overall architecture of tools
>> and data structures and pipelines of these tools.
>>
>> I currently see five or six distinct phases to a web search engine.
>> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>>   Ok shard management might not be so much a phase as a functionality.
>> Acquire is simply the acquisition of the document be it PDF, HTML, or
>> images.  This would usually be the crawler phase.  Parse is parsing that
>> content into useful and standard data structures.  I do believe that
>> parsing should be separate and distinct from crawling.  If you crawl 50%
>> of 5M pages and the crawler dies, you should still be able to use that
>> 50% you crawled.  Analyze is what we do with the content once it is
>> parsed into a standard structure we can use.  This could be anything
>> from a better link analysis to natural language processing, language
>> identification, and machine learning.  The analysis phase should
>> probably have an ever expanding set of tools for different purposes.
>> These tools would create specialized data structures of their own.
>> Eventually through all the analysis we come up with a score for a given
>> piece of content.  That could be a document or a field.  Indexing is the
>> process of taking the analysis scores and content and creating the
>> indexes for searching.  Searching is concerned with the searching of the
>> indexes.  This should be doable from command line, web based, or other
>> ways.  Shard management is concerned with the deployment and management
>> of large number of indexes.
>>
>> I think the next generation of nutch should allow the changing of
>> different tools in any of these areas.  What this means is the ability
>> to have different components such as web crawlers (as long as the end
>> data structure is the same), for example Fetcher, Fetcher2, Grub,
>> Heretrix, or even specialized crawlers.  And different components for
>> different analysis types.  I don't see a lot of cross-cutting concerns
>> here.  And where there is, url normalization for example, I think it can
>> be handled better through dependency injection.
>>
>> Which brings me to three.  I think it is time to get rid of the plugin
>> framework.  I want to keep the functionality of the various plugins but
>> I think a dependency injection framework, such as spring, creating the
>> components needed for logic inside of tools is a much cleaner way to
>> proceed.  This would allow much better unit and mock testing of tool and
>> logic functionality.  It would allow Nutch to run on a non "nutchified"
>> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
>> core jars and contrib jars and a contrib directory which is pulled from
>> by shell scripts when submitting jobs to Hadoop.  With the
>> multiple-resources functionality in Hadoop it would be a simple matter
>> of creating the correct command lines for the job to run.
>>
>> And that brings me to separation of data and presentation.  Currently
>> the Nutch website is one monolithic jsp application with plugins.  I
>> think the next generation should segment that out into xml / json feeds
>> and a separate front end that uses those feeds.  Again this would make
>> it much easier to create web applications using nutch.
>>
>> And of course I think that shard management, a la Hadoop master and
>> slave style, is a big requirement as well.  I also think a full test
>> suite with mock objects and local and MiniMR and MiniDFS cluster testing
>> is important as is better documentation and tutorials (maybe even a book
>> :)).  So up to this point I have created MapReduce jobs that use spring
>> for dependency injection and it is simple and works well.  The above is
>> the direction I would like to head down but I would also like to see
>> what everyone else is thinking.
>>
>> Dennis
>>
>>
>>
>>
>>
>>
>>
>>
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> [hidden email]
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Sami Siren-2
In reply to this post by Dennis Kubes-2
Dennis Kubes wrote:

great points Dennis and i have to say that I agree with most of them.
I'd like to add that nutch should not try to do all by itself but
concentrate on it's core functionality (what ever it will eventually
be), let's open our eyes and see what already exists out there. For
example we could use something like Tika for parsing (through some
_abstraction_) instead of maintaining our own set of parsers, use
something like openpipe as doc processing(analysing) pipeline etc.

(some more comments inline)

> I have been thinking about a next generation Nutch for a while now,
> had some talks with some of the other committers, and have gotten
> around to putting some thoughts / requirements down on paper.  I
> wanted to run these by the community and get feedback.  This message
> will be a bit long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or
> vertical search engines in the 1M-20B document range.  I am excluding
> things such as database centric search and possibly even local
> filesystem search. IMO Solr is a very capable enterprise search
> product and could handle local filesystem search (if it doesn't
> already) and Nutch shouldn't try to overlap functionality.  I think it
> should be able to interact, maybe share indexes yes, but not overlap
> purpose.  I think that Nutch should be designed to handle large
> datasets, meaning it has the ability to scale to billions, perhaps 10s
> of billions of pages.  Hadoop already gives us this capability for
> processing but Nutch would need to improve on the search server and
> shard management side  of things to be able to scale to the billion
> page level.  So the next generation of Nutch I think should focus on
> web scale search.
so from protocol perspective this means http (and perhaps https) ?

In large(r) scale the performance will also be playing more important
role than before as so far Nutch has mostly (IMO) been about functionality.

>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small
> tools that would be linked through data structures.  We already do
> this to some extent in the current version of Nutch, an example of
> which would be the tools that generate and act on CrawlDatum objects
> (CrawlDb, UpdateDb, etc.).  I would like to keep that idea of tools
> and data structures wth the tools are chained together perhaps only by
> shell or management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects
> but Hadoop allows any input / output formats whether that be to HBase,
> a relational database, a local filesytem.  I think we should be open
> to have those data structures stored however is best for the user
> through different hadoop formats.  So a general overall architecture
> of tools and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine.
> They are;  Acquire, Parse, Analyze, Index, Search, and Shard
> Management.  Ok shard management might not be so much a phase as a
> functionality. Acquire is simply the acquisition of the document be it
> PDF, HTML, or images.  This would usually be the crawler phase.  Parse
> is parsing that content into useful and standard data structures.  I
> do believe that parsing should be separate and distinct from
> crawling.  If you crawl 50% of 5M pages and the crawler dies, you
> should still be able to use that 50% you crawled.  Analyze is what we
> do with the content once it is parsed into a standard structure we can
> use.  This could be anything from a better link analysis to natural
> language processing, language identification, and machine learning.  
> The analysis phase should probably have an ever expanding set of tools
> for different purposes. These tools would create specialized data
> structures of their own. Eventually through all the analysis we come
> up with a score for a given piece of content.  That could be a
> document or a field.  Indexing is the process of taking the analysis
> scores and content and creating the indexes for searching.  Searching
> is concerned with the searching of the indexes.  This should be doable
> from command line, web based, or other ways.  Shard management is
> concerned with the deployment and management of large number of indexes.
We should also see if distributed solr (as otis noted) / hadoops
distributed lucene indexing are good enough to start with.

>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it
> can be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.
+1
> I want to keep the functionality of the various plugins but I think a
> dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool
> and logic functionality.  
The lack of junit tests in nutch has been a big burden for it (in
general amount of junit tests seems to somewhat correlate to how
easy/hard they are to write :) so if we architecture the system to be
easily testable (small isolated units) we could simultaneously rise the
bar for junit testing it and also make it easier to refactor later.

> It would allow Nutch to run on a non "nutchified" Hadoop cluster,
> meaning just a plain old hadoop cluster.  We could have core jars and
> contrib jars and a contrib directory which is pulled from by shell
> scripts when submitting jobs to Hadoop.  With the multiple-resources
> functionality in Hadoop it would be a simple matter of creating the
> correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json
> feeds and a separate front end that uses those feeds.  Again this
> would make it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster
> testing is important as is better documentation and tutorials (maybe
> even a book :)).  So up to this point I have created MapReduce jobs
> that use spring for dependency injection and it is simple and works
> well.  The above is the direction I would like to head down but I
> would also like to see what everyone else is thinking.
>
> Dennis
>

--
 Sami Siren

Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

wuqi-2
In reply to this post by Dennis Kubes-2
I am a frequent Nutch user, and now are building a travel vertical search engine. First of all,I would thank NUTCH community especially NUTCH committers for providing us a so good product. Though I had zero experiences in search engine before, NUTCH can help me to build a search engine from scratch in a very short time. But recently, I find that the development on NUTCH has been INACTIVE  for rarely long time, and I am also considering truing to Solr. I am really glad to see this email, and it give me the confidence on NUTCH again and  want to contribute some ideas. I am Chinese, so please forgive me my poor English..

1. Who are using NUTCH?  What should Nutch like?.
From the nutch-user list , I find that most nutch users are same as  me, with limited experience and knowledge in search engine, and want to build a search engine very quickly, not have  much concern on flexibility of  nutch development. This is very similar with Tomcat user, now there seldom are people who modify tomcat code, most user just download,install and run it .
I hope that Nutch could  stick to Lucene and Hadoop, and  the possibility of using other crawlers like Hertix might be not important. I also developed some nutch plugins, and nutch plug-in system satisfy me well .
I want NUTCH can provide a high performance, stable, highly scalable search platform. Based on this platform, I can easily  make modification on some important part such as index and search ranking.Or  If  I need to modify some code on Lucene,and the modification can easily integrate with Nutch.  

2. The difference between NUTCH and SOLR ?
 for enterprise search, NUTCH for internet search, we all hope this. I  think using NUTCH to build a internet search engine  is much easier than using SOLR. But I also find there are  many(even more )  internet based search engine are now using SOLR, and I am also considering using SOLR,and  use Nutch just as a crawler.
Solr for Enterprise and Nutch for internet  make sense for me , I think there are some big shortage in Nutch drive many people to use SOLR.

3. Why I am considering to SOLR.
My  biggest concern on Nutch is that I don't have confidence in the search scalability ,distributed index and  index online change.I  am glad if  anyone here can share his successful experience or provide some hints on how to implement this based on NUTCH.


4. How to speed up the development of Nutch ?
Just as stated  in the beginning of this mail, NUTCH development is very inactive now.I hope committers should be more active and take the leadership.We'll have a full time people working on Nutch during this year, I also hope we could contribute more on Nutch.

I seldom write long email in English as this , Hope my opinion can help something.

Thanks
-Qi



----- Original Message -----
From: "Dennis Kubes" <[hidden email]>
To: <[hidden email]>
Sent: Saturday, April 12, 2008 5:59 AM
Subject: Next Generation Nutch


>I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine.
> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>  Ok shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2
In reply to this post by chrismattmann


Chris Mattmann wrote:

> Hi Dennis,
>
> Thanks for putting this together. I think that it's also important to add to
> this list the ability to cleanly separate out the following major
> components:
>
> 1. The underlying distributed computing infrastructure (e.g., why does it
> have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
> or what about even grid computing technologies, and web services? Hadoop can
> certainly be _the_ core implementation of the underlying substrate, but the
> ability to change this out should be a lot easier than it currently is. Read
> on below to see what I mean.)
>
> 2. The crawler. Right now I think it's much too tied to the underlying
> orchestration process and infrastructure.

I agree. With distinct component standardizing on common data structures
though it should be possible to have any type of fetcher a person wants
(even a custom one) as long as it outputs correctly or there is a tool
to convert from its output to the standard one.

I will say that I think parsing should be a separate job from fetching
and MapReduce may not the best way to do fetching.  It may be it may not
we should be open to that possibility.

>
> 3. The data structures. You do mention this below, but I would add to it
> that the data structures for Nutch should be simple POJOs and not have any
> tie to the underlying infrastructure (e.g., no need for Writeable methods,
> etc.)

I would love to do that if it is possible.  Question is how would we
convert POJOs to/from the writable format needed by hadoop?  I know
hadoop is working on serialization frameworks for the future where you
can use just plain objects and not need to implement writable.  Don't
know where the progress is on that yet.

Dennis

>
> I think that with these types of guiding principles above, along with what
> you mention below, there is the potential here to generate a really
> flexible, reusable architecture, that, when folks come along and mention,
> "Oh I've written Crawler XXX, how do I integrate it into Nutch", we don't
> have to come back and say that the entire system has to be changed; or even
> worse, that it cannot be done at all.
>
> My 2 cents,
>  Chris
>  
>
>
> On 4/11/08 2:59 PM, "Dennis Kubes" <[hidden email]> wrote:
>
>> I have been thinking about a next generation Nutch for a while now, had
>> some talks with some of the other committers, and have gotten around to
>> putting some thoughts / requirements down on paper.  I wanted to run
>> these by the community and get feedback.  This message will be a bit
>> long so please bear with me.
>>
>> First let me define that I think that the purpose of Nutch is to be a
>> web search engine.  When I say that I mean to specifically exclude
>> enterprise search.  By web search I am talking about general or vertical
>> search engines in the 1M-20B document range.  I am excluding things such
>> as database centric search and possibly even local filesystem search.
>> IMO Solr is a very capable enterprise search product and could handle
>> local filesystem search (if it doesn't already) and Nutch shouldn't try
>> to overlap functionality.  I think it should be able to interact, maybe
>> share indexes yes, but not overlap purpose.  I think that Nutch should
>> be designed to handle large datasets, meaning it has the ability to
>> scale to billions, perhaps 10s of billions of pages.  Hadoop already
>> gives us this capability for processing but Nutch would need to improve
>> on the search server and shard management side  of things to be able to
>> scale to the billion page level.  So the next generation of Nutch I
>> think should focus on web scale search.
>>
>> After working with Hadoop and MapReduce for the last couple of years I
>> find it interesting just how similar development of MapReduce programs
>> seem to be to the linux/unix philosophy of small programs chained
>> together to accomplish big things.  So going forward I see this as a
>> healthy overall general architecture.  Nutch would have many small tools
>> that would be linked through data structures.  We already do this to
>> some extent in the current version of Nutch, an example of which would
>> be the tools that generate and act on CrawlDatum objects (CrawlDb,
>> UpdateDb, etc.).  I would like to keep that idea of tools and data
>> structures wth the tools are chained together perhaps only by shell or
>> management scripts, in different pipelines acting on the data
>> structures.  When I say data structure I don't mean binary map or
>> sequence files.  These may be a standard way to store these objects but
>> Hadoop allows any input / output formats whether that be to HBase, a
>> relational database, a local filesytem.  I think we should be open to
>> have those data structures stored however is best for the user through
>> different hadoop formats.  So a general overall architecture of tools
>> and data structures and pipelines of these tools.
>>
>> I currently see five or six distinct phases to a web search engine.
>> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>>   Ok shard management might not be so much a phase as a functionality.
>> Acquire is simply the acquisition of the document be it PDF, HTML, or
>> images.  This would usually be the crawler phase.  Parse is parsing that
>> content into useful and standard data structures.  I do believe that
>> parsing should be separate and distinct from crawling.  If you crawl 50%
>> of 5M pages and the crawler dies, you should still be able to use that
>> 50% you crawled.  Analyze is what we do with the content once it is
>> parsed into a standard structure we can use.  This could be anything
>> from a better link analysis to natural language processing, language
>> identification, and machine learning.  The analysis phase should
>> probably have an ever expanding set of tools for different purposes.
>> These tools would create specialized data structures of their own.
>> Eventually through all the analysis we come up with a score for a given
>> piece of content.  That could be a document or a field.  Indexing is the
>> process of taking the analysis scores and content and creating the
>> indexes for searching.  Searching is concerned with the searching of the
>> indexes.  This should be doable from command line, web based, or other
>> ways.  Shard management is concerned with the deployment and management
>> of large number of indexes.
>>
>> I think the next generation of nutch should allow the changing of
>> different tools in any of these areas.  What this means is the ability
>> to have different components such as web crawlers (as long as the end
>> data structure is the same), for example Fetcher, Fetcher2, Grub,
>> Heretrix, or even specialized crawlers.  And different components for
>> different analysis types.  I don't see a lot of cross-cutting concerns
>> here.  And where there is, url normalization for example, I think it can
>> be handled better through dependency injection.
>>
>> Which brings me to three.  I think it is time to get rid of the plugin
>> framework.  I want to keep the functionality of the various plugins but
>> I think a dependency injection framework, such as spring, creating the
>> components needed for logic inside of tools is a much cleaner way to
>> proceed.  This would allow much better unit and mock testing of tool and
>> logic functionality.  It would allow Nutch to run on a non "nutchified"
>> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
>> core jars and contrib jars and a contrib directory which is pulled from
>> by shell scripts when submitting jobs to Hadoop.  With the
>> multiple-resources functionality in Hadoop it would be a simple matter
>> of creating the correct command lines for the job to run.
>>
>> And that brings me to separation of data and presentation.  Currently
>> the Nutch website is one monolithic jsp application with plugins.  I
>> think the next generation should segment that out into xml / json feeds
>> and a separate front end that uses those feeds.  Again this would make
>> it much easier to create web applications using nutch.
>>
>> And of course I think that shard management, a la Hadoop master and
>> slave style, is a big requirement as well.  I also think a full test
>> suite with mock objects and local and MiniMR and MiniDFS cluster testing
>> is important as is better documentation and tutorials (maybe even a book
>> :)).  So up to this point I have created MapReduce jobs that use spring
>> for dependency injection and it is simple and works well.  The above is
>> the direction I would like to head down but I would also like to see
>> what everyone else is thinking.
>>
>> Dennis
>>
>>
>>
>>
>>
>>
>>
>>
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> [hidden email]
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2
In reply to this post by Sami Siren-2


Sami Siren wrote:
> Dennis Kubes wrote:
>
> great points Dennis and i have to say that I agree with most of them.
> I'd like to add that nutch should not try to do all by itself but
> concentrate on it's core functionality (what ever it will eventually
> be), let's open our eyes and see what already exists out there. For
> example we could use something like Tika for parsing (through some
> _abstraction_) instead of maintaining our own set of parsers, use
> something like openpipe as doc processing(analysing) pipeline etc.

I completely agree.

>
> (some more comments inline)
>
>> I have been thinking about a next generation Nutch for a while now,
>> had some talks with some of the other committers, and have gotten
>> around to putting some thoughts / requirements down on paper.  I
>> wanted to run these by the community and get feedback.  This message
>> will be a bit long so please bear with me.
>>
>> First let me define that I think that the purpose of Nutch is to be a
>> web search engine.  When I say that I mean to specifically exclude
>> enterprise search.  By web search I am talking about general or
>> vertical search engines in the 1M-20B document range.  I am excluding
>> things such as database centric search and possibly even local
>> filesystem search. IMO Solr is a very capable enterprise search
>> product and could handle local filesystem search (if it doesn't
>> already) and Nutch shouldn't try to overlap functionality.  I think it
>> should be able to interact, maybe share indexes yes, but not overlap
>> purpose.  I think that Nutch should be designed to handle large
>> datasets, meaning it has the ability to scale to billions, perhaps 10s
>> of billions of pages.  Hadoop already gives us this capability for
>> processing but Nutch would need to improve on the search server and
>> shard management side  of things to be able to scale to the billion
>> page level.  So the next generation of Nutch I think should focus on
>> web scale search.
> so from protocol perspective this means http (and perhaps https) ?

I think it should have the ability to support other protocols, who knows
maybe somebody wants to start the largest gopher search engine ;) .  So
I don't think we should limit, but I do like focusing.
>
> In large(r) scale the performance will also be playing more important
> role than before as so far Nutch has mostly (IMO) been about functionality.

Absolutely.

>
>>
>> After working with Hadoop and MapReduce for the last couple of years I
>> find it interesting just how similar development of MapReduce programs
>> seem to be to the linux/unix philosophy of small programs chained
>> together to accomplish big things.  So going forward I see this as a
>> healthy overall general architecture.  Nutch would have many small
>> tools that would be linked through data structures.  We already do
>> this to some extent in the current version of Nutch, an example of
>> which would be the tools that generate and act on CrawlDatum objects
>> (CrawlDb, UpdateDb, etc.).  I would like to keep that idea of tools
>> and data structures wth the tools are chained together perhaps only by
>> shell or management scripts, in different pipelines acting on the data
>> structures.  When I say data structure I don't mean binary map or
>> sequence files.  These may be a standard way to store these objects
>> but Hadoop allows any input / output formats whether that be to HBase,
>> a relational database, a local filesytem.  I think we should be open
>> to have those data structures stored however is best for the user
>> through different hadoop formats.  So a general overall architecture
>> of tools and data structures and pipelines of these tools.
>>
>> I currently see five or six distinct phases to a web search engine.
>> They are;  Acquire, Parse, Analyze, Index, Search, and Shard
>> Management.  Ok shard management might not be so much a phase as a
>> functionality. Acquire is simply the acquisition of the document be it
>> PDF, HTML, or images.  This would usually be the crawler phase.  Parse
>> is parsing that content into useful and standard data structures.  I
>> do believe that parsing should be separate and distinct from
>> crawling.  If you crawl 50% of 5M pages and the crawler dies, you
>> should still be able to use that 50% you crawled.  Analyze is what we
>> do with the content once it is parsed into a standard structure we can
>> use.  This could be anything from a better link analysis to natural
>> language processing, language identification, and machine learning.  
>> The analysis phase should probably have an ever expanding set of tools
>> for different purposes. These tools would create specialized data
>> structures of their own. Eventually through all the analysis we come
>> up with a score for a given piece of content.  That could be a
>> document or a field.  Indexing is the process of taking the analysis
>> scores and content and creating the indexes for searching.  Searching
>> is concerned with the searching of the indexes.  This should be doable
>> from command line, web based, or other ways.  Shard management is
>> concerned with the deployment and management of large number of indexes.
> We should also see if distributed solr (as otis noted) / hadoops
> distributed lucene indexing are good enough to start with.
>>
>> I think the next generation of nutch should allow the changing of
>> different tools in any of these areas.  What this means is the ability
>> to have different components such as web crawlers (as long as the end
>> data structure is the same), for example Fetcher, Fetcher2, Grub,
>> Heretrix, or even specialized crawlers.  And different components for
>> different analysis types.  I don't see a lot of cross-cutting concerns
>> here.  And where there is, url normalization for example, I think it
>> can be handled better through dependency injection.
>>
>> Which brings me to three.  I think it is time to get rid of the plugin
>> framework.
> +1
>> I want to keep the functionality of the various plugins but I think a
>> dependency injection framework, such as spring, creating the
>> components needed for logic inside of tools is a much cleaner way to
>> proceed.  This would allow much better unit and mock testing of tool
>> and logic functionality.  
> The lack of junit tests in nutch has been a big burden for it (in
> general amount of junit tests seems to somewhat correlate to how
> easy/hard they are to write :) so if we architecture the system to be
> easily testable (small isolated units) we could simultaneously rise the
> bar for junit testing it and also make it easier to refactor later.

That is why I am liking a DI framework.  If not spring then something
where there can be complete unit tests and mock objects.

Dennis

>
>> It would allow Nutch to run on a non "nutchified" Hadoop cluster,
>> meaning just a plain old hadoop cluster.  We could have core jars and
>> contrib jars and a contrib directory which is pulled from by shell
>> scripts when submitting jobs to Hadoop.  With the multiple-resources
>> functionality in Hadoop it would be a simple matter of creating the
>> correct command lines for the job to run.
>>
>> And that brings me to separation of data and presentation.  Currently
>> the Nutch website is one monolithic jsp application with plugins.  I
>> think the next generation should segment that out into xml / json
>> feeds and a separate front end that uses those feeds.  Again this
>> would make it much easier to create web applications using nutch.
>>
>> And of course I think that shard management, a la Hadoop master and
>> slave style, is a big requirement as well.  I also think a full test
>> suite with mock objects and local and MiniMR and MiniDFS cluster
>> testing is important as is better documentation and tutorials (maybe
>> even a book :)).  So up to this point I have created MapReduce jobs
>> that use spring for dependency injection and it is simple and works
>> well.  The above is the direction I would like to head down but I
>> would also like to see what everyone else is thinking.
>>
>> Dennis
>>
>
> --
> Sami Siren
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2
In reply to this post by Otis Gospodnetic-2


Otis Gospodnetic wrote:
> Hello,
>
> A few quick comments.  I don't know how much you track Solr, but the mention of shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki.  You'll want to check those out.  In short, Solr has the notion of shards and distributed search, kind of like Nutch with its RPC framework and searchers.  *That* is one big duplication of work, IMHO.  As far as the indexing+searching+shards go, I think one direction worth looking at carefully would be the gentle Nutch->Solr relationship -- using Solr to do indexing and searching.  Shard management doesn't exist in either project yet, but I think it would be ideal to come up with a common management mechanism, if possible.
>

In thinking about a new nutch I always thought that shard management is
absolutely necessary but it never felt right in terms of where it
belongs.  If we are saying that nutch is small tools strung together to
produce different types of search indexes, shard management isn't really
a tool.  It is more of something after that is needed.  And yes both
Nutch and Solr as well as other people using lucene indexes need some
type of distributed index management system and I don't want to
duplicate this work.  Perhaps this is a good proposal for a separate
lucene sub-project.  Hey we could even call it shard ;)

Dennis


> I think this addresses your "... but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search." statement.
>
> I know of a well-known, large corporation evaluating Solr (and its dist. search in particular) to handle 1-2B docs and 100 QPS.
>
> I don't fully follow the part about getting rid of plugins, spring, etc., so I can't comment.
>
> Regarding the webapp - perhaps Solr and SolrJ could be used here.  Solr itself is a webapp, and it contains various ResponseWriters that can output XML, JSON, pure Ruby, Python, even binary responses (in JIRA).
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Dennis Kubes <[hidden email]>
> To: [hidden email]
> Sent: Friday, April 11, 2008 5:59:41 PM
> Subject: Next Generation Nutch
>
> I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine.
> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>   Ok shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2
In reply to this post by Dennis Kubes-2
I have put up a wiki page for discussions about a new nutch:

http://wiki.apache.org/nutch/Nutch2Architecture

I would like to discuss this some more and perhaps come up with some
basic tools to prove out concepts for a new architecture if nobody objects.

Dennis

Dennis Kubes wrote:

> I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine. They
> are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.  Ok
> shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Otis Gospodnetic-2
In reply to this post by Dennis Kubes-2
I suppose the first thing to do would be describe the requirements for this shard management.  I imagine you have very specific functionality in mind from your Wikia Search experience.  Mind putting your ideas on the Wiki?  I think it would be very good to share this with solr-dev@lucene early on, so we can come up with something general that fits both Nutch and Solr.  It might turn out that this calls for a separate Lucene project, but we'll see that once the real discussion starts.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dennis Kubes <[hidden email]>
To: [hidden email]
Sent: Sunday, April 13, 2008 5:44:32 PM
Subject: Re: Next Generation Nutch



Otis Gospodnetic wrote:
> Hello,
>
> A few quick comments.  I don't know how much you track Solr, but the mention of shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki.  You'll want to check those out.  In short, Solr has the notion of shards and distributed search, kind of like Nutch with its RPC framework and searchers.  *That* is one big duplication of work, IMHO.  As far as the indexing+searching+shards go, I think one direction worth looking at carefully would be the gentle Nutch->Solr relationship -- using Solr to do indexing and searching.  Shard management doesn't exist in either project yet, but I think it would be ideal to come up with a common management mechanism, if possible.
>

In thinking about a new nutch I always thought that shard management is
absolutely necessary but it never felt right in terms of where it
belongs.  If we are saying that nutch is small tools strung together to
produce different types of search indexes, shard management isn't really
a tool.  It is more of something after that is needed.  And yes both
Nutch and Solr as well as other people using lucene indexes need some
type of distributed index management system and I don't want to
duplicate this work.  Perhaps this is a good proposal for a separate
lucene sub-project.  Hey we could even call it shard ;)

Dennis


> I think this addresses your "... but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search." statement.
>
> I know of a well-known, large corporation evaluating Solr (and its dist. search in particular) to handle 1-2B docs and 100 QPS.
>
> I don't fully follow the part about getting rid of plugins, spring, etc., so I can't comment.
>
> Regarding the webapp - perhaps Solr and SolrJ could be used here.  Solr itself is a webapp, and it contains various ResponseWriters that can output XML, JSON, pure Ruby, Python, even binary responses (in JIRA).
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Dennis Kubes <[hidden email]>
> To: [hidden email]
> Sent: Friday, April 11, 2008 5:59:41 PM
> Subject: Next Generation Nutch
>
> I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine.
> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>   Ok shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>
>
>
>



Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2


Otis Gospodnetic wrote:
> I suppose the first thing to do would be describe the requirements for this shard management.  I imagine you have very specific functionality in mind from your Wikia Search experience.  Mind putting your ideas on the Wiki?  I think it would be very good to share this with solr-dev@lucene early on, so we can come up with something general that fits both Nutch and Solr.  It might turn out that this calls for a separate Lucene project, but we'll see that once the real discussion starts.
>

I completely agree.  This would be better as a shared project.  I will
put my current thoughts down on the Nutch wiki, unless there is already
a discussion going somewhere?

Dennis

> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Dennis Kubes <[hidden email]>
> To: [hidden email]
> Sent: Sunday, April 13, 2008 5:44:32 PM
> Subject: Re: Next Generation Nutch
>
>
>
> Otis Gospodnetic wrote:
>> Hello,
>>
>> A few quick comments.  I don't know how much you track Solr, but the mention of shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki.  You'll want to check those out.  In short, Solr has the notion of shards and distributed search, kind of like Nutch with its RPC framework and searchers.  *That* is one big duplication of work, IMHO.  As far as the indexing+searching+shards go, I think one direction worth looking at carefully would be the gentle Nutch->Solr relationship -- using Solr to do indexing and searching.  Shard management doesn't exist in either project yet, but I think it would be ideal to come up with a common management mechanism, if possible.
>>
>
> In thinking about a new nutch I always thought that shard management is
> absolutely necessary but it never felt right in terms of where it
> belongs.  If we are saying that nutch is small tools strung together to
> produce different types of search indexes, shard management isn't really
> a tool.  It is more of something after that is needed.  And yes both
> Nutch and Solr as well as other people using lucene indexes need some
> type of distributed index management system and I don't want to
> duplicate this work.  Perhaps this is a good proposal for a separate
> lucene sub-project.  Hey we could even call it shard ;)
>
> Dennis
>
>
>> I think this addresses your "... but Nutch would need to improve
>> on the search server and shard management side  of things to be able to
>> scale to the billion page level.  So the next generation of Nutch I
>> think should focus on web scale search." statement.
>>
>> I know of a well-known, large corporation evaluating Solr (and its dist. search in particular) to handle 1-2B docs and 100 QPS.
>>
>> I don't fully follow the part about getting rid of plugins, spring, etc., so I can't comment.
>>
>> Regarding the webapp - perhaps Solr and SolrJ could be used here.  Solr itself is a webapp, and it contains various ResponseWriters that can output XML, JSON, pure Ruby, Python, even binary responses (in JIRA).
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>> ----- Original Message ----
>> From: Dennis Kubes <[hidden email]>
>> To: [hidden email]
>> Sent: Friday, April 11, 2008 5:59:41 PM
>> Subject: Next Generation Nutch
>>
>> I have been thinking about a next generation Nutch for a while now, had
>> some talks with some of the other committers, and have gotten around to
>> putting some thoughts / requirements down on paper.  I wanted to run
>> these by the community and get feedback.  This message will be a bit
>> long so please bear with me.
>>
>> First let me define that I think that the purpose of Nutch is to be a
>> web search engine.  When I say that I mean to specifically exclude
>> enterprise search.  By web search I am talking about general or vertical
>> search engines in the 1M-20B document range.  I am excluding things such
>> as database centric search and possibly even local filesystem search.
>> IMO Solr is a very capable enterprise search product and could handle
>> local filesystem search (if it doesn't already) and Nutch shouldn't try
>> to overlap functionality.  I think it should be able to interact, maybe
>> share indexes yes, but not overlap purpose.  I think that Nutch should
>> be designed to handle large datasets, meaning it has the ability to
>> scale to billions, perhaps 10s of billions of pages.  Hadoop already
>> gives us this capability for processing but Nutch would need to improve
>> on the search server and shard management side  of things to be able to
>> scale to the billion page level.  So the next generation of Nutch I
>> think should focus on web scale search.
>>
>> After working with Hadoop and MapReduce for the last couple of years I
>> find it interesting just how similar development of MapReduce programs
>> seem to be to the linux/unix philosophy of small programs chained
>> together to accomplish big things.  So going forward I see this as a
>> healthy overall general architecture.  Nutch would have many small tools
>> that would be linked through data structures.  We already do this to
>> some extent in the current version of Nutch, an example of which would
>> be the tools that generate and act on CrawlDatum objects (CrawlDb,
>> UpdateDb, etc.).  I would like to keep that idea of tools and data
>> structures wth the tools are chained together perhaps only by shell or
>> management scripts, in different pipelines acting on the data
>> structures.  When I say data structure I don't mean binary map or
>> sequence files.  These may be a standard way to store these objects but
>> Hadoop allows any input / output formats whether that be to HBase, a
>> relational database, a local filesytem.  I think we should be open to
>> have those data structures stored however is best for the user through
>> different hadoop formats.  So a general overall architecture of tools
>> and data structures and pipelines of these tools.
>>
>> I currently see five or six distinct phases to a web search engine.
>> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>>   Ok shard management might not be so much a phase as a functionality.
>> Acquire is simply the acquisition of the document be it PDF, HTML, or
>> images.  This would usually be the crawler phase.  Parse is parsing that
>> content into useful and standard data structures.  I do believe that
>> parsing should be separate and distinct from crawling.  If you crawl 50%
>> of 5M pages and the crawler dies, you should still be able to use that
>> 50% you crawled.  Analyze is what we do with the content once it is
>> parsed into a standard structure we can use.  This could be anything
>> from a better link analysis to natural language processing, language
>> identification, and machine learning.  The analysis phase should
>> probably have an ever expanding set of tools for different purposes.
>> These tools would create specialized data structures of their own.
>> Eventually through all the analysis we come up with a score for a given
>> piece of content.  That could be a document or a field.  Indexing is the
>> process of taking the analysis scores and content and creating the
>> indexes for searching.  Searching is concerned with the searching of the
>> indexes.  This should be doable from command line, web based, or other
>> ways.  Shard management is concerned with the deployment and management
>> of large number of indexes.
>>
>> I think the next generation of nutch should allow the changing of
>> different tools in any of these areas.  What this means is the ability
>> to have different components such as web crawlers (as long as the end
>> data structure is the same), for example Fetcher, Fetcher2, Grub,
>> Heretrix, or even specialized crawlers.  And different components for
>> different analysis types.  I don't see a lot of cross-cutting concerns
>> here.  And where there is, url normalization for example, I think it can
>> be handled better through dependency injection.
>>
>> Which brings me to three.  I think it is time to get rid of the plugin
>> framework.  I want to keep the functionality of the various plugins but
>> I think a dependency injection framework, such as spring, creating the
>> components needed for logic inside of tools is a much cleaner way to
>> proceed.  This would allow much better unit and mock testing of tool and
>> logic functionality.  It would allow Nutch to run on a non "nutchified"
>> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
>> core jars and contrib jars and a contrib directory which is pulled from
>> by shell scripts when submitting jobs to Hadoop.  With the
>> multiple-resources functionality in Hadoop it would be a simple matter
>> of creating the correct command lines for the job to run.
>>
>> And that brings me to separation of data and presentation.  Currently
>> the Nutch website is one monolithic jsp application with plugins.  I
>> think the next generation should segment that out into xml / json feeds
>> and a separate front end that uses those feeds.  Again this would make
>> it much easier to create web applications using nutch.
>>
>> And of course I think that shard management, a la Hadoop master and
>> slave style, is a big requirement as well.  I also think a full test
>> suite with mock objects and local and MiniMR and MiniDFS cluster testing
>> is important as is better documentation and tutorials (maybe even a book
>> :)).  So up to this point I have created MapReduce jobs that use spring
>> for dependency injection and it is simple and works well.  The above is
>> the direction I would like to head down but I would also like to see
>> what everyone else is thinking.
>>
>> Dennis
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Andrzej Białecki-2
Dennis Kubes wrote:

>
>
> Otis Gospodnetic wrote:
>> I suppose the first thing to do would be describe the requirements for
>> this shard management.  I imagine you have very specific functionality
>> in mind from your Wikia Search experience.  Mind putting your ideas on
>> the Wiki?  I think it would be very good to share this with
>> solr-dev@lucene early on, so we can come up with something general
>> that fits both Nutch and Solr.  It might turn out that this calls for
>> a separate Lucene project, but we'll see that once the real discussion
>> starts.
>>
>
> I completely agree.  This would be better as a shared project.  I will
> put my current thoughts down on the Nutch wiki, unless there is already
> a discussion going somewhere?

There is a description of a related concept here:
http://wiki.apache.org/hadoop/DistributedLucene . However, this
addresses only the index part of the shard - in our case shards also
contain plain text (for summaries) and the original binary content (for
cached preview), and possibly other parts (NUTCH-466) neither of which
is managed by this code.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Otis Gospodnetic-2
In reply to this post by Dennis Kubes-2
And there is http://wiki.apache.org/solr/DistributedSearch , but this talks *only* about search.

Dennis, are you the man to take what's on DistributedLucene and DistributedSearch and come up with a marriage proposal? :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Andrzej Bialecki <[hidden email]>
To: [hidden email]
Sent: Monday, April 14, 2008 1:01:37 PM
Subject: Re: Next Generation Nutch

Dennis Kubes wrote:

>
>
> Otis Gospodnetic wrote:
>> I suppose the first thing to do would be describe the requirements for
>> this shard management.  I imagine you have very specific functionality
>> in mind from your Wikia Search experience.  Mind putting your ideas on
>> the Wiki?  I think it would be very good to share this with
>> solr-dev@lucene early on, so we can come up with something general
>> that fits both Nutch and Solr.  It might turn out that this calls for
>> a separate Lucene project, but we'll see that once the real discussion
>> starts.
>>
>
> I completely agree.  This would be better as a shared project.  I will
> put my current thoughts down on the Nutch wiki, unless there is already
> a discussion going somewhere?

There is a description of a related concept here:
http://wiki.apache.org/hadoop/DistributedLucene . However, this
addresses only the index part of the shard - in our case shards also
contain plain text (for summaries) and the original binary content (for
cached preview), and possibly other parts (NUTCH-466) neither of which
is managed by this code.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2
I can do that but it will come after I finish some reqs on the next gen
nutch. :)  I do consider shard management to be part of that.

Dennis

Otis Gospodnetic wrote:

> And there is http://wiki.apache.org/solr/DistributedSearch , but this talks *only* about search.
>
> Dennis, are you the man to take what's on DistributedLucene and DistributedSearch and come up with a marriage proposal? :)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Andrzej Bialecki <[hidden email]>
> To: [hidden email]
> Sent: Monday, April 14, 2008 1:01:37 PM
> Subject: Re: Next Generation Nutch
>
> Dennis Kubes wrote:
>>
>> Otis Gospodnetic wrote:
>>> I suppose the first thing to do would be describe the requirements for
>>> this shard management.  I imagine you have very specific functionality
>>> in mind from your Wikia Search experience.  Mind putting your ideas on
>>> the Wiki?  I think it would be very good to share this with
>>> solr-dev@lucene early on, so we can come up with something general
>>> that fits both Nutch and Solr.  It might turn out that this calls for
>>> a separate Lucene project, but we'll see that once the real discussion
>>> starts.
>>>
>> I completely agree.  This would be better as a shared project.  I will
>> put my current thoughts down on the Nutch wiki, unless there is already
>> a discussion going somewhere?
>
> There is a description of a related concept here:
> http://wiki.apache.org/hadoop/DistributedLucene . However, this
> addresses only the index part of the shard - in our case shards also
> contain plain text (for summaries) and the original binary content (for
> cached preview), and possibly other parts (NUTCH-466) neither of which
> is managed by this code.
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Dennis Kubes-2
In reply to this post by Dennis Kubes-2
I have put down a rough draft architecture on the wiki at:

http://wiki.apache.org/nutch/Nutch2Architecture

Would love feedback and changes.

Dennis

Dennis Kubes wrote:

> I have been thinking about a next generation Nutch for a while now, had
> some talks with some of the other committers, and have gotten around to
> putting some thoughts / requirements down on paper.  I wanted to run
> these by the community and get feedback.  This message will be a bit
> long so please bear with me.
>
> First let me define that I think that the purpose of Nutch is to be a
> web search engine.  When I say that I mean to specifically exclude
> enterprise search.  By web search I am talking about general or vertical
> search engines in the 1M-20B document range.  I am excluding things such
> as database centric search and possibly even local filesystem search.
> IMO Solr is a very capable enterprise search product and could handle
> local filesystem search (if it doesn't already) and Nutch shouldn't try
> to overlap functionality.  I think it should be able to interact, maybe
> share indexes yes, but not overlap purpose.  I think that Nutch should
> be designed to handle large datasets, meaning it has the ability to
> scale to billions, perhaps 10s of billions of pages.  Hadoop already
> gives us this capability for processing but Nutch would need to improve
> on the search server and shard management side  of things to be able to
> scale to the billion page level.  So the next generation of Nutch I
> think should focus on web scale search.
>
> After working with Hadoop and MapReduce for the last couple of years I
> find it interesting just how similar development of MapReduce programs
> seem to be to the linux/unix philosophy of small programs chained
> together to accomplish big things.  So going forward I see this as a
> healthy overall general architecture.  Nutch would have many small tools
> that would be linked through data structures.  We already do this to
> some extent in the current version of Nutch, an example of which would
> be the tools that generate and act on CrawlDatum objects (CrawlDb,
> UpdateDb, etc.).  I would like to keep that idea of tools and data
> structures wth the tools are chained together perhaps only by shell or
> management scripts, in different pipelines acting on the data
> structures.  When I say data structure I don't mean binary map or
> sequence files.  These may be a standard way to store these objects but
> Hadoop allows any input / output formats whether that be to HBase, a
> relational database, a local filesytem.  I think we should be open to
> have those data structures stored however is best for the user through
> different hadoop formats.  So a general overall architecture of tools
> and data structures and pipelines of these tools.
>
> I currently see five or six distinct phases to a web search engine. They
> are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.  Ok
> shard management might not be so much a phase as a functionality.
> Acquire is simply the acquisition of the document be it PDF, HTML, or
> images.  This would usually be the crawler phase.  Parse is parsing that
> content into useful and standard data structures.  I do believe that
> parsing should be separate and distinct from crawling.  If you crawl 50%
> of 5M pages and the crawler dies, you should still be able to use that
> 50% you crawled.  Analyze is what we do with the content once it is
> parsed into a standard structure we can use.  This could be anything
> from a better link analysis to natural language processing, language
> identification, and machine learning.  The analysis phase should
> probably have an ever expanding set of tools for different purposes.
> These tools would create specialized data structures of their own.
> Eventually through all the analysis we come up with a score for a given
> piece of content.  That could be a document or a field.  Indexing is the
> process of taking the analysis scores and content and creating the
> indexes for searching.  Searching is concerned with the searching of the
> indexes.  This should be doable from command line, web based, or other
> ways.  Shard management is concerned with the deployment and management
> of large number of indexes.
>
> I think the next generation of nutch should allow the changing of
> different tools in any of these areas.  What this means is the ability
> to have different components such as web crawlers (as long as the end
> data structure is the same), for example Fetcher, Fetcher2, Grub,
> Heretrix, or even specialized crawlers.  And different components for
> different analysis types.  I don't see a lot of cross-cutting concerns
> here.  And where there is, url normalization for example, I think it can
> be handled better through dependency injection.
>
> Which brings me to three.  I think it is time to get rid of the plugin
> framework.  I want to keep the functionality of the various plugins but
> I think a dependency injection framework, such as spring, creating the
> components needed for logic inside of tools is a much cleaner way to
> proceed.  This would allow much better unit and mock testing of tool and
> logic functionality.  It would allow Nutch to run on a non "nutchified"
> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
> core jars and contrib jars and a contrib directory which is pulled from
> by shell scripts when submitting jobs to Hadoop.  With the
> multiple-resources functionality in Hadoop it would be a simple matter
> of creating the correct command lines for the job to run.
>
> And that brings me to separation of data and presentation.  Currently
> the Nutch website is one monolithic jsp application with plugins.  I
> think the next generation should segment that out into xml / json feeds
> and a separate front end that uses those feeds.  Again this would make
> it much easier to create web applications using nutch.
>
> And of course I think that shard management, a la Hadoop master and
> slave style, is a big requirement as well.  I also think a full test
> suite with mock objects and local and MiniMR and MiniDFS cluster testing
> is important as is better documentation and tutorials (maybe even a book
> :)).  So up to this point I have created MapReduce jobs that use spring
> for dependency injection and it is simple and works well.  The above is
> the direction I would like to head down but I would also like to see
> what everyone else is thinking.
>
> Dennis
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

chrishane
In reply to this post by Dennis Kubes-2


Dennis Kubes wrote:

>>> Which brings me to three.  I think it is time to get rid of the
>>> plugin framework.
>> +1
>>> I want to keep the functionality of the various plugins but I think a
>>> dependency injection framework, such as spring, creating the
>>> components needed for logic inside of tools is a much cleaner way to
>>> proceed.  This would allow much better unit and mock testing of tool
>>> and logic functionality.  
>> The lack of junit tests in nutch has been a big burden for it (in
>> general amount of junit tests seems to somewhat correlate to how
>> easy/hard they are to write :) so if we architecture the system to be
>> easily testable (small isolated units) we could simultaneously rise
>> the bar for junit testing it and also make it easier to refactor later.
>
> That is why I am liking a DI framework.  If not spring then something
> where there can be complete unit tests and mock objects.
>
> Dennis

How about Guice (http://code.google.com/p/google-guice/).  We have been
using it in our projects and like it very much.  The pros as I see them
would be:
  - quick understandability of the framework
  - compile time type checking
  - very easy to change components (altough, compile is required)
  - no xml configuration

Anyway, just a thought.  We evaluated Spring before they released the
annotation version.  At the time it was just to heavy weight when we just
wanted a DI framework and not all the other stuff they include.  I'm not a
Spring expert so take my knowledge about the product with a grain of salt.

Chris....
Reply | Threaded
Open this post in threaded view
|

Re: Next Generation Nutch

Otis Gospodnetic-2-2
In reply to this post by Dennis Kubes-2
Thanks for the explanation, Chris.
As for a separate crawler, there is Droids.  I don't know its exact state, but I did see it had a couple of GSoC entries.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Chris Mattmann <[hidden email]>
To: [hidden email]
Sent: Saturday, April 12, 2008 12:29:15 AM
Subject: Re: Next Generation Nutch

Hi Otis,

Thanks for your comments. My responses inline below:

>
> Hm, I have to say I'm not sure if I agree 100% with part 1.  I think it would
> be great to have such flexibility, but I wonder if trying to achieve it would
> be over-engineering.  Do people really need that?  I don't know, maybe!  If
> they do, then ignore my comment. :)

Well, in the past, at least in my experience, this is exactly what has paid
off for us. Having the flexibility to architect a system that isn't tied to
the underlying technology. We once had a situation at JPL where a software
product was using CORBA for its underlying middleware implementation
framework. This (previously free) CORBA solution turned into a 30K/year
licensed solution, at the direction of the vendor in a 1 week timeframe.
Because we had architected and engineered our software system to be
independent of the underlying middleware substrate, we were able to switch
over to a free, Java-RMI based solution in the matter of a weekend.

Of course, this is typically bound to certain classes of underlying
substrates, and middleware solutions (e.g., it would be difficult to switch
out certain middlewares with vastly different architectural styles, say, if
we were trying to switch from CORBA to a P2P based solution like JXTA), but
all I'm saying is that it would be great if we didn't have to dictate to a
potential Nutch 2.0 user that to use our scalable, open source search engine
solution, you have to change from a JMS house to a Hadoop house. It would be
nice to say that we've architected Nutch 2.0 to be independent of the
underlying middleware provider. Of course, we can provide a default
implementation based on the existing Hadoop substrate, but we should provide
interfaces, data components, and architectural guidelines to be able to
change to say, a Nutch solution over XML-RPC, or Web-Services, or JMS,
without breaking the core architecture. Right now, I'm convinced that can't
be done, or in other words, it's too hard to tease the Hadoop notions out of
Nutch as it exists today.

>
> I'm curious about 2. - could you please explain a little what you mean by "too
> tied to the underlying
> orchestration process and infrastructure."?

What I mean by this is that the Fetcher/Fetcher2 dictates the orchestration
process for crawling: there is no separate, independent Nutch crawler.
Fetcher2 itself is a MapRunnable job (e.g., a term from the Hadoop
vocabulary). In my mind, the crawler process needs to be a separate
subsystem in Nutch, independent of the underlying middleware substrate (kind
of like I'm suggesting above). As an example: how would we take the existing
Nutch Fetcher2, and run it over JMS? Or XML-RPC? Or RMI?

So, I guess that's all I'm saying -- the Nutch 2.0 architecture should be
clearly insulated from the underlying middleware technology. That's my main
concern moving forward.

Hope that helps to explain my point of view. :) If not, let me know and I
would be happy to chat more about it. Thanks!

Cheers,
Chris


>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
> ----- Original Message ----
> From: Chris Mattmann <[hidden email]>
> To: [hidden email]
> Sent: Friday, April 11, 2008 9:10:30 PM
> Subject: Re: Next Generation Nutch
>
> Hi Dennis,
>
> Thanks for putting this together. I think that it's also important to add to
> this list the ability to cleanly separate out the following major
> components:
>
> 1. The underlying distributed computing infrastructure (e.g., why does it
> have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
> or what about even grid computing technologies, and web services? Hadoop can
> certainly be _the_ core implementation of the underlying substrate, but the
> ability to change this out should be a lot easier than it currently is. Read
> on below to see what I mean.)
>
> 2. The crawler. Right now I think it's much too tied to the underlying
> orchestration process and infrastructure.
>
> 3. The data structures. You do mention this below, but I would add to it
> that the data structures for Nutch should be simple POJOs and not have any
> tie to the underlying infrastructure (e.g., no need for Writeable methods,
> etc.)
>
> I think that with these types of guiding principles above, along with what
> you mention below, there is the potential here to generate a really
> flexible, reusable architecture, that, when folks come along and mention,
> "Oh I've written Crawler XXX, how do I integrate it into Nutch", we don't
> have to come back and say that the entire system has to be changed; or even
> worse, that it cannot be done at all.
>
> My 2 cents,
>  Chris
>  
>
>
> On 4/11/08 2:59 PM, "Dennis Kubes" <[hidden email]> wrote:
>
>> I have been thinking about a next generation Nutch for a while now, had
>> some talks with some of the other committers, and have gotten around to
>> putting some thoughts / requirements down on paper.  I wanted to run
>> these by the community and get feedback.  This message will be a bit
>> long so please bear with me.
>>
>> First let me define that I think that the purpose of Nutch is to be a
>> web search engine.  When I say that I mean to specifically exclude
>> enterprise search.  By web search I am talking about general or vertical
>> search engines in the 1M-20B document range.  I am excluding things such
>> as database centric search and possibly even local filesystem search.
>> IMO Solr is a very capable enterprise search product and could handle
>> local filesystem search (if it doesn't already) and Nutch shouldn't try
>> to overlap functionality.  I think it should be able to interact, maybe
>> share indexes yes, but not overlap purpose.  I think that Nutch should
>> be designed to handle large datasets, meaning it has the ability to
>> scale to billions, perhaps 10s of billions of pages.  Hadoop already
>> gives us this capability for processing but Nutch would need to improve
>> on the search server and shard management side  of things to be able to
>> scale to the billion page level.  So the next generation of Nutch I
>> think should focus on web scale search.
>>
>> After working with Hadoop and MapReduce for the last couple of years I
>> find it interesting just how similar development of MapReduce programs
>> seem to be to the linux/unix philosophy of small programs chained
>> together to accomplish big things.  So going forward I see this as a
>> healthy overall general architecture.  Nutch would have many small tools
>> that would be linked through data structures.  We already do this to
>> some extent in the current version of Nutch, an example of which would
>> be the tools that generate and act on CrawlDatum objects (CrawlDb,
>> UpdateDb, etc.).  I would like to keep that idea of tools and data
>> structures wth the tools are chained together perhaps only by shell or
>> management scripts, in different pipelines acting on the data
>> structures.  When I say data structure I don't mean binary map or
>> sequence files.  These may be a standard way to store these objects but
>> Hadoop allows any input / output formats whether that be to HBase, a
>> relational database, a local filesytem.  I think we should be open to
>> have those data structures stored however is best for the user through
>> different hadoop formats.  So a general overall architecture of tools
>> and data structures and pipelines of these tools.
>>
>> I currently see five or six distinct phases to a web search engine.
>> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>>   Ok shard management might not be so much a phase as a functionality.
>> Acquire is simply the acquisition of the document be it PDF, HTML, or
>> images.  This would usually be the crawler phase.  Parse is parsing that
>> content into useful and standard data structures.  I do believe that
>> parsing should be separate and distinct from crawling.  If you crawl 50%
>> of 5M pages and the crawler dies, you should still be able to use that
>> 50% you crawled.  Analyze is what we do with the content once it is
>> parsed into a standard structure we can use.  This could be anything
>> from a better link analysis to natural language processing, language
>> identification, and machine learning.  The analysis phase should
>> probably have an ever expanding set of tools for different purposes.
>> These tools would create specialized data structures of their own.
>> Eventually through all the analysis we come up with a score for a given
>> piece of content.  That could be a document or a field.  Indexing is the
>> process of taking the analysis scores and content and creating the
>> indexes for searching.  Searching is concerned with the searching of the
>> indexes.  This should be doable from command line, web based, or other
>> ways.  Shard management is concerned with the deployment and management
>> of large number of indexes.
>>
>> I think the next generation of nutch should allow the changing of
>> different tools in any of these areas.  What this means is the ability
>> to have different components such as web crawlers (as long as the end
>> data structure is the same), for example Fetcher, Fetcher2, Grub,
>> Heretrix, or even specialized crawlers.  And different components for
>> different analysis types.  I don't see a lot of cross-cutting concerns
>> here.  And where there is, url normalization for example, I think it can
>> be handled better through dependency injection.
>>
>> Which brings me to three.  I think it is time to get rid of the plugin
>> framework.  I want to keep the functionality of the various plugins but
>> I think a dependency injection framework, such as spring, creating the
>> components needed for logic inside of tools is a much cleaner way to
>> proceed.  This would allow much better unit and mock testing of tool and
>> logic functionality.  It would allow Nutch to run on a non "nutchified"
>> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
>> core jars and contrib jars and a contrib directory which is pulled from
>> by shell scripts when submitting jobs to Hadoop.  With the
>> multiple-resources functionality in Hadoop it would be a simple matter
>> of creating the correct command lines for the job to run.
>>
>> And that brings me to separation of data and presentation.  Currently
>> the Nutch website is one monolithic jsp application with plugins.  I
>> think the next generation should segment that out into xml / json feeds
>> and a separate front end that uses those feeds.  Again this would make
>> it much easier to create web applications using nutch.
>>
>> And of course I think that shard management, a la Hadoop master and
>> slave style, is a big requirement as well.  I also think a full test
>> suite with mock objects and local and MiniMR and MiniDFS cluster testing
>> is important as is better documentation and tutorials (maybe even a book
>> :)).  So up to this point I have created MapReduce jobs that use spring
>> for dependency injection and it is simple and works well.  The above is
>> the direction I would like to head down but I would also like to see
>> what everyone else is thinking.
>>
>> Dennis
>>
>>
>>
>>
>>
>>
>>
>>
>
> ______________________________________________
> Chris Mattmann, Ph.D.
> [hidden email]
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
>
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
>
>
>
>
>

______________________________________________
Chris Mattmann, Ph.D.
[hidden email]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.