Re: Per-page crawling policy

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Per-page crawling policy

Ken Krugler-3
Hi Andrzej,

>I've been toying with the following idea, which is an extension of
>the existing URLFilter mechanism and the concept of a "crawl
>frontier".
>
>Let's suppose we have several initial seed urls, each with a
>different subjective quality. We would like to crawl these, and
>expand the "crawling frontier" using outlinks. However, we don't
>want to do it uniformly for every initial url, but rather propagate
>certain "crawling policy" through the expanding trees of linked
>pages. This "crawling policy" could consist of url filters, scoring
>methods, etc - basically anything configurable in Nutch could be
>included in this "policy". Perhaps it could even be the new version
>of non-static NutchConf ;-)
>
>Then, if a given initial url is a known high-quality source, we
>would like to apply a "favor" policy, where we e.g. add pages linked
>from that url, and in doing so we give them a higher score.
>Recursively, we could apply the same policy for the next generation
>pages, or perhaps only for pages belonging to the same domain. So,
>in a sense the original notion of high-quality would cascade down to
>other linked pages. The important aspect of this to note is that all
>newly discovered pages would be subject to the same policy - unless
>we have compelling reasons to switch the policy (from "favor" to
>"default" or to "distrust"), which at that point would essentially
>change the shape of the expanding frontier.
>
>If a given initial url is a known spammer, we would like to apply a
>"distrust" policy for adding pages linked from that url (e.g. adding
>or not adding, if adding then lowering their score, or applying
>different score calculation). And recursively we could apply a
>similar policy of "distrust" to any pages discovered this way. We
>could also change the policy on the way, if there are compelling
>reasons to do so. This means that we could follow some high-quality
>links from low-quality pages, without drilling down the sites which
>are known to be of low quality.
>
>Special care needs to be taken if the same page is discovered from
>pages with different policies, I haven't thought about this aspect
>yet... ;-)

This sounds like the TrustRank algorithm. See
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
attenuation via trust dampening (reducing the trust level as you get
further from a trusted page) and trust splitting (OPIC-like approach).

>What would be the benefits of such approach?
>
>* the initial page + policy would both control the expanding
>crawling frontier, and it could be differently defined for different
>starting pages. I.e. in a single web database we could keep
>different "collections" or "areas of interest" with differently
>specified policies. But still we could reap the benefits of a single
>web db, namely the link information.
>
>* URLFilters could be grouped into several policies, and it would be
>easy to switch between them, or edit them.
>
>* if the crawl process realizes it ended up on a spam page, it can
>switch the page policy to "distrust", or the other way around, and
>stop crawling unwanted content. From now on the pages linked from
>that page will follow the new policy. In other words, if a crawling
>frontier reaches pages with known quality problems, it would be easy
>to change the policy on-the-fly to avoid them or pages linked from
>them, without resorting to modifications of URLFilters.
>
>Some of the above you can do even now with URLFilters, but any
>change you do now has global consequences. You may also end up with
>awfully complicated rules if you try to cover all cases in one rule
>set.

The approach we took (with Nutch 0.7) is to use the Page nextScore
field as a 'crawl priority' field. We apply a scoring function to
each page, that takes into account the contents of the page, the
page's domain (we have a hand-selected set of known "good" domains
and sub-domains), and the page's OPIC score. This gets divided up
among the valid outlinks (after passing these through the URL
filter), and summed into the appropriate Page.nextScore entries in
the web db.

Then at crawl time we sort by nextScore, and pick a percentage of the
total unfetched links.

This gives us a pretty good depth-first crawl, where "depth" is
defined by the page content scoring function and the set of trusted
domains.

The four issues we've run into with this approach are:

1. Even with a pretty broad area of interest, you wind up focusing on
a subset of all domains. Which then means that the max threads per
host limit (for polite crawling) starts killing your efficiency.

To work around this, we've modified Nutch to order the fetch list
URLs by domain, and constrain the max # of URLs per domain based on
the total number of URLs to be fetched, and the number of threads
we're using.

The next step is to turn on HTTP keep-alive, which would be pretty
easy other than some funky issues we've run into with the http
connection pool, and the fact that the current protocol plug-in API
doesn't give us a good channel to pass back the info that the fetch
thread needs to control the keep alive process.

2. With a vertical crawl, you seem to wind up at "Big Bob's Server"
sooner/more often than with a breadth first crawl. Going wide means
you spend a lot more time on sites with lots of pages, and these are
typically higher performance/better behaved. With vertical, we seem
to hit a lot more slow/nonconforming servers, which then kill our
fetch performance.

Typical issues are things like servers sending back an endless stream
of HTTP response headers, or trickling data back for a big file.

To work around this, we've implemented support for monitoring thread
performance and interrupting threads that are taking too long.

3. With a vertical crawl, you typically want to keep the number of
fetched URLs (per loop) at a pretty low percentage relative to the
total number of unfetched URLs in the WebDB. This helps with
maintaining good crawl focus.

The problem we've run into is that the percentage of time spent
updating the WebDB, once you're past about 10M pages, starts to
dominate the total crawl time.

For example, we can fetch 200K pages in an hour on one machine, but
then it takes us 2.5 hours to update a 15M page WebDB with the
results. We can do a fetch in parallel with the update from the
previous crawl, but that doesn't buy us much. The typical solution is
to increase the number of URLs fetched, to balance out the fetch vs.
update time, but this winds up wasting bandwidth when most of the
extra URLs you fetch aren't all that interesting.

We're hoping that switching to Nutch 0.8 will help solve this issue.
We're in the middle of integrating our mods from 0.7 - don't know how
hard this will be.

4. We'd like to save multiple scores for a page, for the case where
we're applying multiple scoring functions to a page. But modifying
the WebDB to support a set of scores per page, while not hurting
performance, seems tricky.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
Reply | Threaded
Open this post in threaded view
|

Re: Per-page crawling policy

Andrzej Białecki-2
Hi Ken,

First of all, thanks for sharing your insights, that's a very
interesting read.

Ken Krugler wrote:
> This sounds like the TrustRank algorithm. See
> http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
> attenuation via trust dampening (reducing the trust level as you get
> further from a trusted page) and trust splitting (OPIC-like approach).

Yes, I'm familiar with that paper, that's what got me thinking  ... ;)

> 1. Even with a pretty broad area of interest, you wind up focusing on
> a subset of all domains. Which then means that the max threads per
> host limit (for polite crawling) starts killing your efficiency.

The "policies" approach that I described is able to follow and
distribute the scores along any links, not necessarily within the
domain, so I think we could avoid this.

> The next step is to turn on HTTP keep-alive, which would be pretty
> easy other than some funky issues we've run into with the http
> connection pool, and the fact that the current protocol plug-in API
> doesn't give us a good channel to pass back the info that the fetch
> thread needs to control the keep alive process.

I'm afraid this is still the case with 0.8, the way protocol plugins
interact with the Fetcher and the fetchlist it's not easy to improve it.

>
> 2. With a vertical crawl, you seem to wind up at "Big Bob's Server"
> sooner/more often than with a breadth first crawl. Going wide means
> you spend a lot more time on sites with lots of pages, and these are
> typically higher performance/better behaved. With vertical, we seem to
> hit a lot more slow/nonconforming servers, which then kill our fetch
> performance.
>
> Typical issues are things like servers sending back an endless stream
> of HTTP response headers, or trickling data back for a big file.
>
> To work around this, we've implemented support for monitoring thread
> performance and interrupting threads that are taking too long.

I agree, this is something that needs to be added, especially to HTTP
protocol plugins. The code for HTTP plugins has been refactored, so I
hope it will be easier to implement this monitoring in 0.8 ...

>
> 3. With a vertical crawl, you typically want to keep the number of
> fetched URLs (per loop) at a pretty low percentage relative to the
> total number of unfetched URLs in the WebDB. This helps with
> maintaining good crawl focus.
>
> The problem we've run into is that the percentage of time spent
> updating the WebDB, once you're past about 10M pages, starts to
> dominate the total crawl time.

I know what you mean, I went through it with a 20 mln page installation
and it wasn't pleasant. This is definitely not the case with 0.8 CrawlDB
(which replaces the WebDB). All update times seem to be linearly
proportional to the number of entries, and much shorter overall. So,
even with a large DB and large updates the updatedb command is pretty fast.

> 4. We'd like to save multiple scores for a page, for the case where
> we're applying multiple scoring functions to a page. But modifying the
> WebDB to support a set of scores per page, while not hurting
> performance, seems tricky.
Again, this will be different in 0.8 - we are going to add support for
arbitrary metadata to CrawlDB entries (formerly known as Page, now
CrawlDatum). You will be able to add multiple scores just fine, at a
relatively small cost in performance for generate/update operations -
it's not possible to avoid at least some performance loss, adding more
data always means you need to process it... but overall these operations
scale much better in 0.8 than before.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply | Threaded
Open this post in threaded view
|

Re: Per-page crawling policy

kkrugler
Hi Andrzej,

>>1. Even with a pretty broad area of interest, you wind up focusing
>>on a subset of all domains. Which then means that the max threads
>>per host limit (for polite crawling) starts killing your efficiency.
>
>The "policies" approach that I described is able to follow and
>distribute the scores along any links, not necessarily within the
>domain, so I think we could avoid this.

The issue here is that we score pages statically - so every page
doesn't start with a score of 1.0f, but rather a score from
0.0...1.0f.

Then we then divide this score among (valid) outlinks, and sum these
into the OPIC scores for the referenced pages.

When we use these scores to rank URLs to fetch, and we constrain (via
topN) the number of URLs fetched each round (to help focus the
search) we wind up with what looks like an exponential curve for
sites/domain - the top few domains wind up with most of the URLs.

We modified Nutch 0.7 to restrict (as a percentage of total URLs
being fetched) the number of URLs coming from any one domain. I see
that 0.8 has something similar, but it's a max-URLs-per-domain,
whereas a percentage makes more sense I think.

We also had to add in "kill the fetch" support. We monitor the fetch
threads, and when the ratio of active threads (fetching) to unactive
(blocked) threads drops below a threshold we terminate the fetch.
This then compensates for issues where a popular site is also a
low-performing site.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200