Optimizing which links to fetch

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimizing which links to fetch

kkrugler
Hi all,

It seems that the default behavior of Nutch when sorting links to
fetch is to use scoreByLinkCount. This then sets the fetch score for
links on a page to be the same as the containing page's "in-bound
link" score (or actually the log of same).

What I'd like to do is rate each link on a page separately, based on
its proximity to key words and other calculated hot-spots. Has this
been done before? Is the support already there, and I haven't found
it yet?

If I need to do it myself, the most straightforward approach would be
to modify emitFetchList() to parse each page (from webdb.pages()),
matching up the anchors with what's returned by
dbAnchors.getanchors(). But this seems inefficient and awkward. Would
it be better to do this analysis when parsing the HTML originally,
and somehow save each anchor's score in the web DB?

Thanks,

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Optimizing which links to fetch

Doug Cutting-2
Ken Krugler wrote:
> It seems that the default behavior of Nutch when sorting links to fetch
> is to use scoreByLinkCount. This then sets the fetch score for links on
> a page to be the same as the containing page's "in-bound link" score (or
> actually the log of same).

Please also see:

http://issues.apache.org/jira/browse/NUTCH-61

This is an extensible mechanism for altering the fetch schedule.
Similarly, we need an extensible mechanism for computing page scores,
which are used to prioritize the fetching of scheduled pages.  Note that
the scoring mechanism has changed substantially in the development trunk
from what is in the 0.7 release.

Doug