Nutch near future - strategic directions

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch near future - strategic directions

Andrzej Białecki-2
Hi all,

The ApacheCon is over, our release 1.0 has been out already for some
time, so I think it's a good moment to discuss what are the next steps
in Nutch development.

Let me share with you the topics I identified and presented in the
ApacheCon slides, and some topics that are worth discussing based on
various conversations I had there, and the discussions we had on our
mailing list:

1. Avoid duplication of effort
------------------------------
Currently we spend significant effort on implementing functionality that
other projects are dedicated to. Instead of doing the same work, and
sometimes poorly, we should concentrate on delegating and reusing:

* Use Tika for content parsing: this will require some effort and
collaboration with the Tika project, to improve Tika's ability to handle
more complex formats well (e.g. hierarchical compound documents such as
archives, mailboxes, RSS), and to contribute any missing parsers (e.g.
parse-swf).

* Use Solr for indexing & search: it is hard to justify the effort of
developing and maintaining our own search server - Solr offers much more
functionality, configurability, performance and ease of integration than
our relatively primitive search server. Our integration with Solr needs
to be improved so that it's easier to setup and operate.

* Use database-like storage abstraction: this may seem like a serious
departure from the current architecture, but I don't mean that we should
switch to an SQL DB ... what this means is that we should provide an
option to use HBase, as well as the current plain MapFile-s (and perhaps
other types of DBs, such as Berkeley DB or SQL, if it makes sense) as
our storage. There is a very promising initial port of Nutch to HBase,
which is currently closely integrated with HBase API (which is both good
and bad) - it provides several improvements over our current storage, so
I think it's worth using as the new default, but let's see if we can
make it more abstract.

* Plugins: the initial OSGI port looks good, but I'm not sure yet at
this moment if the benefits of OSGI outweigh the cost of this change ...

* Shard management: this is currently an Achilles' heel of Nutch, where
users are left on their own ... If we switch to using HBase then at
least on the crawling side the shard management will become much easier.
This still leaves the problem of deploying new content to search
server(s). The candidate framework for this side of the shard management
is Katta + patches provided by Ted Dunning (see ???). If we switch to
using Solr we would have to  also use the Katta / Solr integration, and
perhaps Solr/Hadoop integration as well. This is a complex mix of
half-ready components that needs to be well thought-through ...

* Crawler Commons: during our Crawler MeetUp all representatives agreed
that we should collect a few components that are nearly the same across
all projects and collaborate on their development, and use them as an
external dependency. The candidate components are:

  - robots.txt parsing
  - URL filtering and normalization
  - page signature (fingerprint) implementations
  - page template detection & removal (aka. main content extraction)
  - possibly others, like URL redirection tracking, PageRank
calculation, crawler trap detection etc.

2. Make Nutch easier to use
---------------------------
This, as you may remember our earlier discussions, begs the question:
who is the target audience of Nutch?

In my opinion, the main users of Nutch are vertical search engines, and
this is the audience that we should cater to. There are many reasons for
this:

- Nutch is too complex and too heavy for those that need to crawl up to
a few thousand pages. Now that the Droids project exists it's probably
not worth the effort to attempt a complete re-design of Nutch so that it
fits the need of this group - Nutch is based on map-reduce, and it's not
likely we would want to change that, so this means there will always be
a significant overhead for small jobs. I'm not saying we should not make
Nutch easier to use, but for small crawls Nutch is an overkill. Also, in
many cases these users don't realize that they don't do any frontier
discovery and expansion, and what they really need is Solr.

- at the other end of the spectrum, there are very very few companies
that want to do a wide large web-scale crawling - this is costly, and
requires a solid business plan and serious funding. These users are
prepared anyway to spend significant effort on customizations and
problem-solving, or they may want to use only some parts of Nutch. Often
they are also not too eager to contribute back to the project - either
because of their proprietary nature or because their customizations are
not useful for general audience.

The remaining group is interested in medium-size, high quality crawling
(focused, with good spam & junk controls). Which is either an enterprise
search or a vertical search. We should make Nutch an attractive platform
for such users, and we should discuss what this entails. Also, if we
refactor Nutch in the way I described above, it will be easier for such
users to contribute back to Nutch and other related projects.

3. Provide a platform for solving the really interesting issues
---------------------------------------------------------------
Nutch has many bits and pieces that implement really smart algorithms
and heuristics to solve difficult issues that occur in crawling. The
problem is that they are often well hidden and poorly documented, and
their interaction with the rest of the system is far from obvious.
Sometimes this is related to premature performance optimizations, in
other cases this is just a poorly abstracted design. Examples would
include the OPIC scoring, meta-tags & metadata handling, deduplication,
redirection handling, etc.

Even though these components are usually implemented as plugins, this
lack of transparency and poor design makes it difficult to experiment
with Nutch. I believe that improving this area will result in many more
users contributing back to the project, both from business and from
academia.

And there are quite a few interesting challenges to solve:

* crawl scheduling, i.e. determining the order and composition of
fetchlists to maximize the crawling speed.

* spam & junk detection (I won't go into details on this, there are tons
of literature on the subject)

* crawler trap handling (e.g. the classic calendar page that generates
infinite number of pages).

* enterprise-specific ranking and scoring. This includes users' feedback
(explicit and implicit, e.g. click-throughs)

* pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)

* near-duplicate detection, and closely related issue of extraction of
the main content from a templated page.

* URL aliasing (e.g. www.a.com == a.com == a.com/index.html ==
a.com/default.asp), and what happens with inlinks to such aliased pages.
Also related to this is the problem of temporary/permanent redirects and
complete mirrors.

Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an
attractive platform to develop and experiment with such components.

-----------------
Briefly ;) that's what comes to my mind when I think about the future of
Nutch. I invite you all to share your thoughts and suggestions!

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch near future - strategic directions

Subhojit Roy
Hi,

Would it be possible to include in Nutch, the ability to crawl & download a
page only if the page has been updated since the last crawl? I had read
sometime back that there were plans to include such a feature. It would be a
very useful feature to have IMO. This of course depends on the "last
modified" timestamp being present on the webpage that is being crawled,
which I believe is not mandatory. Still those who do set it would benefit.

Thanks,
-sroy

On Mon, Nov 9, 2009 at 9:54 PM, Andrzej Bialecki <[hidden email]> wrote:

> Hi all,
>
> The ApacheCon is over, our release 1.0 has been out already for some time,
> so I think it's a good moment to discuss what are the next steps in Nutch
> development.
>
> Let me share with you the topics I identified and presented in the
> ApacheCon slides, and some topics that are worth discussing based on various
> conversations I had there, and the discussions we had on our mailing list:
>
> 1. Avoid duplication of effort
> ------------------------------
> Currently we spend significant effort on implementing functionality that
> other projects are dedicated to. Instead of doing the same work, and
> sometimes poorly, we should concentrate on delegating and reusing:
>
> * Use Tika for content parsing: this will require some effort and
> collaboration with the Tika project, to improve Tika's ability to handle
> more complex formats well (e.g. hierarchical compound documents such as
> archives, mailboxes, RSS), and to contribute any missing parsers (e.g.
> parse-swf).
>
> * Use Solr for indexing & search: it is hard to justify the effort of
> developing and maintaining our own search server - Solr offers much more
> functionality, configurability, performance and ease of integration than our
> relatively primitive search server. Our integration with Solr needs to be
> improved so that it's easier to setup and operate.
>
> * Use database-like storage abstraction: this may seem like a serious
> departure from the current architecture, but I don't mean that we should
> switch to an SQL DB ... what this means is that we should provide an option
> to use HBase, as well as the current plain MapFile-s (and perhaps other
> types of DBs, such as Berkeley DB or SQL, if it makes sense) as our storage.
> There is a very promising initial port of Nutch to HBase, which is currently
> closely integrated with HBase API (which is both good and bad) - it provides
> several improvements over our current storage, so I think it's worth using
> as the new default, but let's see if we can make it more abstract.
>
> * Plugins: the initial OSGI port looks good, but I'm not sure yet at this
> moment if the benefits of OSGI outweigh the cost of this change ...
>
> * Shard management: this is currently an Achilles' heel of Nutch, where
> users are left on their own ... If we switch to using HBase then at least on
> the crawling side the shard management will become much easier. This still
> leaves the problem of deploying new content to search server(s). The
> candidate framework for this side of the shard management is Katta + patches
> provided by Ted Dunning (see ???). If we switch to using Solr we would have
> to  also use the Katta / Solr integration, and perhaps Solr/Hadoop
> integration as well. This is a complex mix of half-ready components that
> needs to be well thought-through ...
>
> * Crawler Commons: during our Crawler MeetUp all representatives agreed
> that we should collect a few components that are nearly the same across all
> projects and collaborate on their development, and use them as an external
> dependency. The candidate components are:
>
>  - robots.txt parsing
>  - URL filtering and normalization
>  - page signature (fingerprint) implementations
>  - page template detection & removal (aka. main content extraction)
>  - possibly others, like URL redirection tracking, PageRank calculation,
> crawler trap detection etc.
>
> 2. Make Nutch easier to use
> ---------------------------
> This, as you may remember our earlier discussions, begs the question: who
> is the target audience of Nutch?
>
> In my opinion, the main users of Nutch are vertical search engines, and
> this is the audience that we should cater to. There are many reasons for
> this:
>
> - Nutch is too complex and too heavy for those that need to crawl up to a
> few thousand pages. Now that the Droids project exists it's probably not
> worth the effort to attempt a complete re-design of Nutch so that it fits
> the need of this group - Nutch is based on map-reduce, and it's not likely
> we would want to change that, so this means there will always be a
> significant overhead for small jobs. I'm not saying we should not make Nutch
> easier to use, but for small crawls Nutch is an overkill. Also, in many
> cases these users don't realize that they don't do any frontier discovery
> and expansion, and what they really need is Solr.
>
> - at the other end of the spectrum, there are very very few companies that
> want to do a wide large web-scale crawling - this is costly, and requires a
> solid business plan and serious funding. These users are prepared anyway to
> spend significant effort on customizations and problem-solving, or they may
> want to use only some parts of Nutch. Often they are also not too eager to
> contribute back to the project - either because of their proprietary nature
> or because their customizations are not useful for general audience.
>
> The remaining group is interested in medium-size, high quality crawling
> (focused, with good spam & junk controls). Which is either an enterprise
> search or a vertical search. We should make Nutch an attractive platform for
> such users, and we should discuss what this entails. Also, if we refactor
> Nutch in the way I described above, it will be easier for such users to
> contribute back to Nutch and other related projects.
>
> 3. Provide a platform for solving the really interesting issues
> ---------------------------------------------------------------
> Nutch has many bits and pieces that implement really smart algorithms and
> heuristics to solve difficult issues that occur in crawling. The problem is
> that they are often well hidden and poorly documented, and their interaction
> with the rest of the system is far from obvious. Sometimes this is related
> to premature performance optimizations, in other cases this is just a poorly
> abstracted design. Examples would include the OPIC scoring, meta-tags &
> metadata handling, deduplication, redirection handling, etc.
>
> Even though these components are usually implemented as plugins, this lack
> of transparency and poor design makes it difficult to experiment with Nutch.
> I believe that improving this area will result in many more users
> contributing back to the project, both from business and from academia.
>
> And there are quite a few interesting challenges to solve:
>
> * crawl scheduling, i.e. determining the order and composition of
> fetchlists to maximize the crawling speed.
>
> * spam & junk detection (I won't go into details on this, there are tons of
> literature on the subject)
>
> * crawler trap handling (e.g. the classic calendar page that generates
> infinite number of pages).
>
> * enterprise-specific ranking and scoring. This includes users' feedback
> (explicit and implicit, e.g. click-throughs)
>
> * pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)
>
> * near-duplicate detection, and closely related issue of extraction of the
> main content from a templated page.
>
> * URL aliasing (e.g. www.a.com == a.com == a.com/index.html ==
> a.com/default.asp), and what happens with inlinks to such aliased pages.
> Also related to this is the problem of temporary/permanent redirects and
> complete mirrors.
>
> Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an
> attractive platform to develop and experiment with such components.
>
> -----------------
> Briefly ;) that's what comes to my mind when I think about the future of
> Nutch. I invite you all to share your thoughts and suggestions!
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


--
Subhojit Roy
Profound Technologies
(Search Solutions based on Open Source)
email: [hidden email]
http://www.profound.in
Reply | Threaded
Open this post in threaded view
|

Re: Nutch near future - strategic directions

Andrzej Białecki-2
Subhojit Roy wrote:
> Hi,
>
> Would it be possible to include in Nutch, the ability to crawl & download a
> page only if the page has been updated since the last crawl? I had read
> sometime back that there were plans to include such a feature. It would be a
> very useful feature to have IMO. This of course depends on the "last
> modified" timestamp being present on the webpage that is being crawled,
> which I believe is not mandatory. Still those who do set it would benefit.

This is already implemented - see the Signature / MD5Signature /
TextProfileSignature.


--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch near future - strategic directions

dmcole
At 2:44 PM +0100 11/16/09, Andrzej Bialecki wrote:
>This is already implemented - see the Signature / MD5Signature /
>TextProfileSignature.

OK, then could somebody explain how to implement this feature? Does
the initial indexing require a special commmand-line? Then does the
secondary indexing require a different command-line?

Thanks.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
    David M. Cole                                            [hidden email]
    Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
    Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
Reply | Threaded
Open this post in threaded view
|

Re: Nutch near future - strategic directions

Sami Siren-2
In reply to this post by Andrzej Białecki-2
Lots of good thoughts and ideas, easy to agree with.

Something for the "ease of use" category:
-allow running on top of plain vanilla hadoop
-split into reusable components with nice and clean public api
-publish mvn artifacts so developers can directly use mvn, ivy etc to
pull required dependencies for their specific crawler

My biggest concern is in execution of this (or any other) plan.
Some of the changes or improvements that have been proposed are quite
"heavy" in nature and would require large changes. I am just thinking
that would it still be better to take a fresh start instead of trying to
do this incrementally on top of existing code base.

In the history of Nutch this approach is not something new (remember map
reduce?) and in my opinion it worked nicely then. Perhaps it is
different this time since the changes we are discussing now have many
abstract things hanging in the air, even fundamental ones.

Of course the rewrite approach means that it will take some time before
we actually get into the point where we can start adding real substance
(meaning new features etc).

So to summarize, I would go ahead and put together a branch "nutch N.0"
that would consist of (a.k.a my wish list, hope I am not being too
aggressive here):

-runs on top of plain hadoop
-use osgi (or some other more optimal extension mechanism that fits and
is easy to use)
-basic http/https crawling functionality (with "db abstraction" or hbase
directly and smart data structures that allow flexible and efficient
usage of the data)
-basic solr integration for indexing/search
-basic parsing with tika

After the basics are ok we would start adding and promoting any of the
hidden gems we might have, or some solutions for the interesting challenges.

ps. many of the interesting challenges in your proposal seem to fall in
the category of "data analysis and manipulation" that are mostly, used
after the data has been crawled or between the fetch cycles so many of
those could be implemented into current code base also, somehow I just
feel that things could be made more efficient and understandable if the
foundation (eg. data structures, extendability for example) was in
better shape. Also if written nicely other projects could use them too!

--
  Sami Siren


Andrzej Bialecki wrote:

> Hi all,
>
> The ApacheCon is over, our release 1.0 has been out already for some
> time, so I think it's a good moment to discuss what are the next steps
> in Nutch development.
>
> Let me share with you the topics I identified and presented in the
> ApacheCon slides, and some topics that are worth discussing based on
> various conversations I had there, and the discussions we had on our
> mailing list:
>
> 1. Avoid duplication of effort
> ------------------------------
> Currently we spend significant effort on implementing functionality that
> other projects are dedicated to. Instead of doing the same work, and
> sometimes poorly, we should concentrate on delegating and reusing:
>
> * Use Tika for content parsing: this will require some effort and
> collaboration with the Tika project, to improve Tika's ability to handle
> more complex formats well (e.g. hierarchical compound documents such as
> archives, mailboxes, RSS), and to contribute any missing parsers (e.g.
> parse-swf).
>
> * Use Solr for indexing & search: it is hard to justify the effort of
> developing and maintaining our own search server - Solr offers much more
> functionality, configurability, performance and ease of integration than
> our relatively primitive search server. Our integration with Solr needs
> to be improved so that it's easier to setup and operate.
>
> * Use database-like storage abstraction: this may seem like a serious
> departure from the current architecture, but I don't mean that we should
> switch to an SQL DB ... what this means is that we should provide an
> option to use HBase, as well as the current plain MapFile-s (and perhaps
> other types of DBs, such as Berkeley DB or SQL, if it makes sense) as
> our storage. There is a very promising initial port of Nutch to HBase,
> which is currently closely integrated with HBase API (which is both good
> and bad) - it provides several improvements over our current storage, so
> I think it's worth using as the new default, but let's see if we can
> make it more abstract.
>
> * Plugins: the initial OSGI port looks good, but I'm not sure yet at
> this moment if the benefits of OSGI outweigh the cost of this change ...
>
> * Shard management: this is currently an Achilles' heel of Nutch, where
> users are left on their own ... If we switch to using HBase then at
> least on the crawling side the shard management will become much easier.
> This still leaves the problem of deploying new content to search
> server(s). The candidate framework for this side of the shard management
> is Katta + patches provided by Ted Dunning (see ???). If we switch to
> using Solr we would have to  also use the Katta / Solr integration, and
> perhaps Solr/Hadoop integration as well. This is a complex mix of
> half-ready components that needs to be well thought-through ...
>
> * Crawler Commons: during our Crawler MeetUp all representatives agreed
> that we should collect a few components that are nearly the same across
> all projects and collaborate on their development, and use them as an
> external dependency. The candidate components are:
>
>  - robots.txt parsing
>  - URL filtering and normalization
>  - page signature (fingerprint) implementations
>  - page template detection & removal (aka. main content extraction)
>  - possibly others, like URL redirection tracking, PageRank calculation,
> crawler trap detection etc.
>
> 2. Make Nutch easier to use
> ---------------------------
> This, as you may remember our earlier discussions, begs the question:
> who is the target audience of Nutch?
>
> In my opinion, the main users of Nutch are vertical search engines, and
> this is the audience that we should cater to. There are many reasons for
> this:
>
> - Nutch is too complex and too heavy for those that need to crawl up to
> a few thousand pages. Now that the Droids project exists it's probably
> not worth the effort to attempt a complete re-design of Nutch so that it
> fits the need of this group - Nutch is based on map-reduce, and it's not
> likely we would want to change that, so this means there will always be
> a significant overhead for small jobs. I'm not saying we should not make
> Nutch easier to use, but for small crawls Nutch is an overkill. Also, in
> many cases these users don't realize that they don't do any frontier
> discovery and expansion, and what they really need is Solr.
>
> - at the other end of the spectrum, there are very very few companies
> that want to do a wide large web-scale crawling - this is costly, and
> requires a solid business plan and serious funding. These users are
> prepared anyway to spend significant effort on customizations and
> problem-solving, or they may want to use only some parts of Nutch. Often
> they are also not too eager to contribute back to the project - either
> because of their proprietary nature or because their customizations are
> not useful for general audience.
>
> The remaining group is interested in medium-size, high quality crawling
> (focused, with good spam & junk controls). Which is either an enterprise
> search or a vertical search. We should make Nutch an attractive platform
> for such users, and we should discuss what this entails. Also, if we
> refactor Nutch in the way I described above, it will be easier for such
> users to contribute back to Nutch and other related projects.
>
> 3. Provide a platform for solving the really interesting issues
> ---------------------------------------------------------------
> Nutch has many bits and pieces that implement really smart algorithms
> and heuristics to solve difficult issues that occur in crawling. The
> problem is that they are often well hidden and poorly documented, and
> their interaction with the rest of the system is far from obvious.
> Sometimes this is related to premature performance optimizations, in
> other cases this is just a poorly abstracted design. Examples would
> include the OPIC scoring, meta-tags & metadata handling, deduplication,
> redirection handling, etc.
>
> Even though these components are usually implemented as plugins, this
> lack of transparency and poor design makes it difficult to experiment
> with Nutch. I believe that improving this area will result in many more
> users contributing back to the project, both from business and from
> academia.
>
> And there are quite a few interesting challenges to solve:
>
> * crawl scheduling, i.e. determining the order and composition of
> fetchlists to maximize the crawling speed.
>
> * spam & junk detection (I won't go into details on this, there are tons
> of literature on the subject)
>
> * crawler trap handling (e.g. the classic calendar page that generates
> infinite number of pages).
>
> * enterprise-specific ranking and scoring. This includes users' feedback
> (explicit and implicit, e.g. click-throughs)
>
> * pagelet-level crawling (e.g. portals, RSS feeds, discussion fora)
>
> * near-duplicate detection, and closely related issue of extraction of
> the main content from a templated page.
>
> * URL aliasing (e.g. www.a.com == a.com == a.com/index.html ==
> a.com/default.asp), and what happens with inlinks to such aliased pages.
> Also related to this is the problem of temporary/permanent redirects and
> complete mirrors.
>
> Etc, etc ... I'm pretty sure there are many others. Let's make Nutch an
> attractive platform to develop and experiment with such components.
>
> -----------------
> Briefly ;) that's what comes to my mind when I think about the future of
> Nutch. I invite you all to share your thoughts and suggestions!
>

Reply | Threaded
Open this post in threaded view
|

Re: Nutch near future - strategic directions

Andrzej Białecki-2
Sami Siren wrote:
> Lots of good thoughts and ideas, easy to agree with.
>
> Something for the "ease of use" category:
> -allow running on top of plain vanilla hadoop

What does it mean "plain vanilla" here? Do you mean the current DB
implementation? That's the idea, we should aim for an abstract layer
that can accommodate both HBase and plain MapFile-s.

> -split into reusable components with nice and clean public api
> -publish mvn artifacts so developers can directly use mvn, ivy etc to
> pull required dependencies for their specific crawler

+1, with slight preference towards ivy.

>
> My biggest concern is in execution of this (or any other) plan.
> Some of the changes or improvements that have been proposed are quite
> "heavy" in nature and would require large changes. I am just thinking
> that would it still be better to take a fresh start instead of trying to
> do this incrementally on top of existing code base.

Well ... that's (almost) what Dogacan did with the HBase port. I agree
that we should not feel too constrained by the existing code base, but
it would be silly to throw everything away and start from scratch - we
need to find a middle ground. The crawler-commons and Tika projects
should help us to get rid of the ballast and significantly reduce the
size of our code.

> In the history of Nutch this approach is not something new (remember map
> reduce?) and in my opinion it worked nicely then. Perhaps it is
> different this time since the changes we are discussing now have many
> abstract things hanging in the air, even fundamental ones.

Nutch 0.7 to 0.8 reused a lot of the existing code.

>
> Of course the rewrite approach means that it will take some time before
> we actually get into the point where we can start adding real substance
> (meaning new features etc).
>
> So to summarize, I would go ahead and put together a branch "nutch N.0"
> that would consist of (a.k.a my wish list, hope I am not being too
> aggressive here):
>
> -runs on top of plain hadoop

See above - what do you mean by that?

> -use osgi (or some other more optimal extension mechanism that fits and
> is easy to use)
> -basic http/https crawling functionality (with "db abstraction" or hbase
> directly and smart data structures that allow flexible and efficient
> usage of the data)
> -basic solr integration for indexing/search
> -basic parsing with tika
>
> After the basics are ok we would start adding and promoting any of the
> hidden gems we might have, or some solutions for the interesting
> challenges.

I believe that's more or less where Dogacan's port is right now, except
it's not merged with the OSGI port.

> ps. many of the interesting challenges in your proposal seem to fall in
> the category of "data analysis and manipulation" that are mostly, used
> after the data has been crawled or between the fetch cycles so many of
> those could be implemented into current code base also, somehow I just
> feel that things could be made more efficient and understandable if the
> foundation (eg. data structures, extendability for example) was in
> better shape. Also if written nicely other projects could use them too!

Definitely agree with this. Example: the PageRank package - it works
quite well with the current code, but it's design is obscured by the
ScoringFilter api and the need to maintain its own extended DB-s.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Nutch near future - strategic directions

Sami Siren-2
Andrzej Bialecki wrote:
> Sami Siren wrote:
>> Lots of good thoughts and ideas, easy to agree with.
>>
>> Something for the "ease of use" category:
>> -allow running on top of plain vanilla hadoop
>
> What does it mean "plain vanilla" here? Do you mean the current DB
> implementation? That's the idea, we should aim for an abstract layer
> that can accommodate both HBase and plain MapFile-s.

I was simply trying to say that we should not bundle Hadoop anymore with
Nutch and instead just mention the specific version it should run on top
of as a requirement. I am not totally sure anymore if this is a good idea...

I do not know details about the HBase branch. Would using HBase allow us
easy migration from  one data model to another (without complex code we
now have in our datums). How easy is HBase to manage/setup/configure?

I think Avro looks promising as a data storage technology: has some
support for data model evolution, can be accessed "natively" from many
programming languages, is relatively well performing... The downside at
the moment is that it is not yet fully supported by hadoop mapred (I think).

>> -split into reusable components with nice and clean public api
>> -publish mvn artifacts so developers can directly use mvn, ivy etc to
>> pull required dependencies for their specific crawler
>
> +1, with slight preference towards ivy.

I was not clear here, I think I was referring to users of Nutch instead
of Developers. And in that case the choise of a tool would be up to the
user after the artifacts are in the repo.

Also, I think what I wanted to day is more about the model how would
people that want to do some customization operate instead of a
technology choice.

Creating new plugin:
-create your own build configuration (or use a template we provide)
-implement plugin code
-publish to m2 repository

Creating your custom crawler:
-create your own build configuration (or use a template we might
provide), specify the dependencies you need (plugins basically, from
apache or from anybody else as long as they are available through some
repository)
-potentially write some custom code

We could also still provide a "default" Nutch crawler also, as a build
configuration (basically just xml file + some config) if we wanted.

The new Hadoop maven artifacts also help with this vision since we could
also access hadoop apis (and dependencies) through similar mechanism.

>> My biggest concern is in execution of this (or any other) plan.
>> Some of the changes or improvements that have been proposed are quite
>> "heavy" in nature and would require large changes. I am just thinking
>> that would it still be better to take a fresh start instead of trying
>> to do this incrementally on top of existing code base.
>
> Well ... that's (almost) what Dogacan did with the HBase port. I agree
> that we should not feel too constrained by the existing code base, but
> it would be silly to throw everything away and start from scratch - we
> need to find a middle ground. The crawler-commons and Tika projects
> should help us to get rid of the ballast and significantly reduce the
> size of our code.

I am not aiming to throw everything away, just trying to relax the back
compatibility burden and give "innovation" a chance.

>> In the history of Nutch this approach is not something new (remember
>> map reduce?) and in my opinion it worked nicely then. Perhaps it is
>> different this time since the changes we are discussing now have many
>> abstract things hanging in the air, even fundamental ones.
>
> Nutch 0.7 to 0.8 reused a lot of the existing code.

I am hoping that this time it will not be different.

>>
>> Of course the rewrite approach means that it will take some time
>> before we actually get into the point where we can start adding real
>> substance (meaning new features etc).
>>
>> So to summarize, I would go ahead and put together a branch "nutch
>> N.0" that would consist of (a.k.a my wish list, hope I am not being
>> too aggressive here):
>>
>> -runs on top of plain hadoop
>
> See above - what do you mean by that?
>
>> -use osgi (or some other more optimal extension mechanism that fits
>> and is easy to use)
>> -basic http/https crawling functionality (with "db abstraction" or
>> hbase directly and smart data structures that allow flexible and
>> efficient usage of the data)
>> -basic solr integration for indexing/search
>> -basic parsing with tika
>>
>> After the basics are ok we would start adding and promoting any of the
>> hidden gems we might have, or some solutions for the interesting
>> challenges.
>
> I believe that's more or less where Dogacan's port is right now, except
> it's not merged with the OSGI port.

Are you sure OSGI is the way to go? I Know it has all these nice
features and all but for some reason I feel that we could live with
something simpler. From functional pow: just drop your jars info
classpath and you're all set. So 2 changes here: 1. plugins are jars 2.
no individual classloaders for plugins.

--
  Sami Siren