Benchmark: max fetcher speed

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Benchmark: max fetcher speed

Andrzej Bialecki-2
Hi,

Here's a status line from a Benchmark job that I ran recently:

0/0 threads spinwaiting 38996 pages, 1 errors, 557.1 pages/s, 16995
kb/s, 0 URLs in 2 queues > reduce

Interested in more details? :) I thought so...

* first, this is a synthetic benchmark. Target pages were produced on
the fly with the 'ant proxy' fake handler, with unlimited bandwidth of
localhost connection (proxy was running on localhost). Fake pages were
generated in a way that guaranteed that all pages and all hosts are
unique, i.e. there were no outlinks to the same hosts or to the same
pages across the whole run.

* fetcher.parse=false, i.e. Fetcher only stored the content. There were
100 threads running.

* there was no DNS resolving - all pages are produced by a proxy, so
Nutch doesn't need to resolve names to IPs.

What do these numbers mean? Well, this means that if there are no pesky
obstacles like host-level blocking or DNS resolution or bandwidth
limits, then the Fetcher is insanely fast. ;)

That's good to know, actually - I was afraid that there is some inherent
limitation in Fetcher due to synchronization that prevents it from
working faster than ~100pages/sec (per task). Apparently that's not the
case.

I'll try to set up other benchmarks that introduce some of the above
factors to see which one possibly has undue impact on performance.

Oh, btw - the above benchmark was run on a 1 node Hadoop cluster, with
HBase backend, with the following results:

10/08/13 15:52:41 INFO crawl.WebTableReader: Statistics for WebTable:
10/08/13 15:52:41 INFO crawl.WebTableReader: TOTAL urls:        2588551
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 0:   2588534
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 1:   17
10/08/13 15:52:41 INFO crawl.WebTableReader: min score: 0.0
10/08/13 15:52:41 INFO crawl.WebTableReader: avg score: 1.2037623E-6
10/08/13 15:52:41 INFO crawl.WebTableReader: max score: 1.116
10/08/13 15:52:41 INFO crawl.WebTableReader: status 1
(status_unfetched):       2415981
10/08/13 15:52:41 INFO crawl.WebTableReader: status 2 (status_fetched):
172570
10/08/13 15:52:41 INFO crawl.WebTableReader: WebTable statistics: done
* Plugins:
protocol-http|parse-tika|scoring-opic|urlfilter-regex|urlnormalizer-pass
* Seeds:        1
* Depth:        6
* Threads:      100
* TopN: 9223372036854775807
* TOTAL ELAPSED:        4362745
- stage: inject
         run 0   23910
- stage: generate
         run 0   24187
         run 1   23531
         run 2   24234
         run 3   27095
         run 4   36100
         run 5   129736
- stage: fetch
         run 0   51506
         run 1   60231
         run 2   72187
         run 3   90323
         run 4   383787
         run 5   516860
- stage: parse
         run 0   12205
         run 1   12125
         run 2   15160
         run 3   30101
         run 4   198388
         run 5   850305
- stage: update
         run 0   24297
         run 1   24142
         run 2   24117
         run 3   36127
         run 4   106444
         run 5   1565639


Injection is nearly a no-op (I use a single seed URL), so this gives us
the basic Hadoop overhead per job of ~24 seconds.

Please note that unlike the previous benchmark results this one uses
depth 6 - for unknown reasons even at this depth the number of collected
URLs is _higher_ than at depth 7 run on branch-1.3 ... apparently
there's something weird going on with URL accounting in trunk...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply | Threaded
Open this post in threaded view
|

Re: Benchmark: max fetcher speed

Mattmann, Chris A (3010)
Re: Benchmark: max fetcher speed Haha, awesome, +1 for being fast!


On 8/13/10 8:27 AM, "Andrzej Bialecki" <ab@...> wrote:

Hi,

Here's a status line from a Benchmark job that I ran recently:

0/0 threads spinwaiting 38996 pages, 1 errors, 557.1 pages/s, 16995
kb/s, 0 URLs in 2 queues > reduce

Interested in more details? :) I thought so...

* first, this is a synthetic benchmark. Target pages were produced on
the fly with the 'ant proxy' fake handler, with unlimited bandwidth of
localhost connection (proxy was running on localhost). Fake pages were
generated in a way that guaranteed that all pages and all hosts are
unique, i.e. there were no outlinks to the same hosts or to the same
pages across the whole run.

* fetcher.parse=false, i.e. Fetcher only stored the content. There were
100 threads running.

* there was no DNS resolving - all pages are produced by a proxy, so
Nutch doesn't need to resolve names to IPs.

What do these numbers mean? Well, this means that if there are no pesky
obstacles like host-level blocking or DNS resolution or bandwidth
limits, then the Fetcher is insanely fast. ;)

That's good to know, actually - I was afraid that there is some inherent
limitation in Fetcher due to synchronization that prevents it from
working faster than ~100pages/sec (per task). Apparently that's not the
case.

I'll try to set up other benchmarks that introduce some of the above
factors to see which one possibly has undue impact on performance.

Oh, btw - the above benchmark was run on a 1 node Hadoop cluster, with
HBase backend, with the following results:

10/08/13 15:52:41 INFO crawl.WebTableReader: Statistics for WebTable:
10/08/13 15:52:41 INFO crawl.WebTableReader: TOTAL urls:        2588551
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 0:   2588534
10/08/13 15:52:41 INFO crawl.WebTableReader: retry 1:   17
10/08/13 15:52:41 INFO crawl.WebTableReader: min score: 0.0
10/08/13 15:52:41 INFO crawl.WebTableReader: avg score: 1.2037623E-6
10/08/13 15:52:41 INFO crawl.WebTableReader: max score: 1.116
10/08/13 15:52:41 INFO crawl.WebTableReader: status 1
(status_unfetched):       2415981
10/08/13 15:52:41 INFO crawl.WebTableReader: status 2 (status_fetched):
172570
10/08/13 15:52:41 INFO crawl.WebTableReader: WebTable statistics: done
* Plugins:
protocol-http|parse-tika|scoring-opic|urlfilter-regex|urlnormalizer-pass
* Seeds:        1
* Depth:        6
* Threads:      100
* TopN: 9223372036854775807
* TOTAL ELAPSED:        4362745
- stage: inject
         run 0   23910
- stage: generate
         run 0   24187
         run 1   23531
         run 2   24234
         run 3   27095
         run 4   36100
         run 5   129736
- stage: fetch
         run 0   51506
         run 1   60231
         run 2   72187
         run 3   90323
         run 4   383787
         run 5   516860
- stage: parse
         run 0   12205
         run 1   12125
         run 2   15160
         run 3   30101
         run 4   198388
         run 5   850305
- stage: update
         run 0   24297
         run 1   24142
         run 2   24117
         run 3   36127
         run 4   106444
         run 5   1565639


Injection is nearly a no-op (I use a single seed URL), so this gives us
the basic Hadoop overhead per job of ~24 seconds.

Please note that unlike the previous benchmark results this one uses
depth 6 - for unknown reasons even at this depth the number of collected
URLs is _higher_ than at depth 7 run on branch-1.3 ... apparently
there's something weird going on with URL accounting in trunk...

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@...
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++