benchmarking

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

benchmarking

Edward Quick

Hi,

Has anyone tried benchmarking nutch? I just wondered how long I should expect different stages of a nutch crawl to take.

For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's, and 4GB ram. This is my nutch fetch process:

/usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs -Dhadoop.log.file=hadoop.log -Djava.library.path=/nutch/search/lib/native/Linux-i386-32 -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853

and a fetch of about 100,000 pages (with 20 threads per host) takes around 1-2 hours. Does that seem reasonable or too slow?

Thanks for any help.

Ed.





_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking

Kevin MacDonald-3
Edward,
I have been doing Crawl operations as opposed to the Fetch operation you're
doing below. I think I am a little unclear on the difference. Since you're
specifying a segment path when doing a Fetch does that mean you have already
crawled? If we can break out the operations each of us are doing end to end
perhaps we can get an apples to apples performance comparison. What I am
doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. Most
are from different hosts. I am finding that there are two main blocks of
computation time when I crawl: there is the fetching which seems to happen
quite fast, and that is followed by a lengthy process where the CPU of the
machine is at 100%, but I'm not sure what it's doing. Perhaps it's parsing
at that point? Can you tell me what your operations are and what your
configuration is?

Kevin

On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]>wrote:

>
> Hi,
>
> Has anyone tried benchmarking nutch? I just wondered how long I should
> expect different stages of a nutch crawl to take.
>
> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's,
> and 4GB ram. This is my nutch fetch process:
>
> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> -Dhadoop.log.file=hadoop.log
> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
>
> and a fetch of about 100,000 pages (with 20 threads per host) takes around
> 1-2 hours. Does that seem reasonable or too slow?
>
> Thanks for any help.
>
> Ed.
>
>
>
>
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking

Kevin MacDonald-3
Some additional info. We are running Nutch on one of Amazon's EC2 small
instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
Opteron or 2007 Xeon processor with 1.7GB of RAM.
To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing times
from the log file:
 - Injecting urls into the Crawl db: 1 min.
 - Fetching: 46min
 - Additional processing (unknown): 66min

Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
day. The big block of additional processing happens after the last fetch. I
don't really know what Nutch is doing during that time. Parsing perhaps? I
would really like to know because that is killing my performance.

Kevin

On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[hidden email]>wrote:

> Edward,
> I have been doing Crawl operations as opposed to the Fetch operation you're
> doing below. I think I am a little unclear on the difference. Since you're
> specifying a segment path when doing a Fetch does that mean you have already
> crawled? If we can break out the operations each of us are doing end to end
> perhaps we can get an apples to apples performance comparison. What I am
> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. Most
> are from different hosts. I am finding that there are two main blocks of
> computation time when I crawl: there is the fetching which seems to happen
> quite fast, and that is followed by a lengthy process where the CPU of the
> machine is at 100%, but I'm not sure what it's doing. Perhaps it's parsing
> at that point? Can you tell me what your operations are and what your
> configuration is?
>
> Kevin
>
>
> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]>wrote:
>
>>
>> Hi,
>>
>> Has anyone tried benchmarking nutch? I just wondered how long I should
>> expect different stages of a nutch crawl to take.
>>
>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's,
>> and 4GB ram. This is my nutch fetch process:
>>
>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
>> -Dhadoop.log.file=hadoop.log
>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
>> /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
>>
>> and a fetch of about 100,000 pages (with 20 threads per host) takes around
>> 1-2 hours. Does that seem reasonable or too slow?
>>
>> Thanks for any help.
>>
>> Ed.
>>
>>
>>
>>
>>
>> _________________________________________________________________
>> Make a mini you and download it into Windows Live Messenger
>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking

Doğacan Güney-3
On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[hidden email]> wrote:

> Some additional info. We are running Nutch on one of Amazon's EC2 small
> instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> Opteron or 2007 Xeon processor with 1.7GB of RAM.
> To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing times
> from the log file:
>  - Injecting urls into the Crawl db: 1 min.
>  - Fetching: 46min
>  - Additional processing (unknown): 66min
>
> Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> day. The big block of additional processing happens after the last fetch. I
> don't really know what Nutch is doing during that time. Parsing perhaps? I
> would really like to know because that is killing my performance.
>

Are you using "crawl" command? If you are serious about nutch, I would suggest
that you use individual commands (inject/fetch/parse/etc). This should give you
a better idea of what is taking so long.

> Kevin
>
> On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[hidden email]>wrote:
>
>> Edward,
>> I have been doing Crawl operations as opposed to the Fetch operation you're
>> doing below. I think I am a little unclear on the difference. Since you're
>> specifying a segment path when doing a Fetch does that mean you have already
>> crawled? If we can break out the operations each of us are doing end to end
>> perhaps we can get an apples to apples performance comparison. What I am
>> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only. Most
>> are from different hosts. I am finding that there are two main blocks of
>> computation time when I crawl: there is the fetching which seems to happen
>> quite fast, and that is followed by a lengthy process where the CPU of the
>> machine is at 100%, but I'm not sure what it's doing. Perhaps it's parsing
>> at that point? Can you tell me what your operations are and what your
>> configuration is?
>>
>> Kevin
>>
>>
>> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]>wrote:
>>
>>>
>>> Hi,
>>>
>>> Has anyone tried benchmarking nutch? I just wondered how long I should
>>> expect different stages of a nutch crawl to take.
>>>
>>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz cpu's,
>>> and 4GB ram. This is my nutch fetch process:
>>>
>>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
>>> -Dhadoop.log.file=hadoop.log
>>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
>>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
>>> /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
>>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
>>>
>>> and a fetch of about 100,000 pages (with 20 threads per host) takes around
>>> 1-2 hours. Does that seem reasonable or too slow?
>>>
>>> Thanks for any help.
>>>
>>> Ed.
>>>
>>>
>>>
>>>
>>>
>>> _________________________________________________________________
>>> Make a mini you and download it into Windows Live Messenger
>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>
>>
>>
>



--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

Re: benchmarking

Kevin MacDonald-3
I am using individual commands called from Java. I simply used Crawl.java as
a starting point. Because I'm not using nutch for search I have already
eliminated quite a few things such as building indexes and inverting links.
All the fetching and subsequent lengthy operations are happening in
Fetcher.fetch(segments, threads). My next step is to hack into that and see
what's going on. When I crank up the logs I see a ton of map/reduce stuff
happening.

On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[hidden email]> wrote:

> On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[hidden email]>
> wrote:
> > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> times
> > from the log file:
> >  - Injecting urls into the Crawl db: 1 min.
> >  - Fetching: 46min
> >  - Additional processing (unknown): 66min
> >
> > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> > day. The big block of additional processing happens after the last fetch.
> I
> > don't really know what Nutch is doing during that time. Parsing perhaps?
> I
> > would really like to know because that is killing my performance.
> >
>
> Are you using "crawl" command? If you are serious about nutch, I would
> suggest
> that you use individual commands (inject/fetch/parse/etc). This should give
> you
> a better idea of what is taking so long.
>
> > Kevin
> >
> > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[hidden email]
> >wrote:
> >
> >> Edward,
> >> I have been doing Crawl operations as opposed to the Fetch operation
> you're
> >> doing below. I think I am a little unclear on the difference. Since
> you're
> >> specifying a segment path when doing a Fetch does that mean you have
> already
> >> crawled? If we can break out the operations each of us are doing end to
> end
> >> perhaps we can get an apples to apples performance comparison. What I am
> >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> Most
> >> are from different hosts. I am finding that there are two main blocks of
> >> computation time when I crawl: there is the fetching which seems to
> happen
> >> quite fast, and that is followed by a lengthy process where the CPU of
> the
> >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> parsing
> >> at that point? Can you tell me what your operations are and what your
> >> configuration is?
> >>
> >> Kevin
> >>
> >>
> >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]
> >wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> >>> expect different stages of a nutch crawl to take.
> >>>
> >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> cpu's,
> >>> and 4GB ram. This is my nutch fetch process:
> >>>
> >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> >>> -Dhadoop.log.file=hadoop.log
> >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> >>>
> /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty-ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> >>>
> >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> around
> >>> 1-2 hours. Does that seem reasonable or too slow?
> >>>
> >>> Thanks for any help.
> >>>
> >>> Ed.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> _________________________________________________________________
> >>> Make a mini you and download it into Windows Live Messenger
> >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> >>
> >>
> >>
> >
>
>
>
> --
> Doğacan Güney
>
Reply | Threaded
Open this post in threaded view
|

RE: benchmarking

Edward Quick





Hi Kevin,

Thanks for your reply.

I haven't checked the other crawl stages as the bottleneck for me is the nutch fetch (mine's configured to do parsing and storing content). Also, I'm fetching from an F5 load-balanced pair of domino lotus notes intranet servers so the network speed is not a factor.

The actual time "fetching/parsing" urls seems quick, around 1000 a minute with 20 threads, but then there's some processing after that, which seems to increase exponentially as the list gets bigger. I timed some of the fetches, and you can see here, how much the time increases ater I get to 20000 urls:

bin/nutch readseg -list -dir crawl/segments/
NAME            GENERATED       FETCHER START           FETCHER END             FETCHED PARSED       *ACTUAL TIME ELAPSED*
20080924121754  1               2008-09-24T12:18:04     2008-09-24T12:18:04     1       1
20080924121816  58              2008-09-24T12:18:47     2008-09-24T12:18:53     58      25
20080924121920  363             2008-09-24T12:19:40     2008-09-24T12:20:02     379     214
20080924122034  1987            2008-09-24T12:21:14     2008-09-24T12:22:50     2085    1434
20080924122344  4494            2008-09-24T12:24:25     2008-09-24T12:31:14     4598    4318
20080924125007  6802            2008-09-24T12:51:02     2008-09-24T13:01:32     6874    6462          => 14:43.16 elapsed
20080924131404  8170            2008-09-24T13:15:04     2008-09-24T13:22:29     8317    7802          => 13:26.59 elapsed
20080924132912  21065           2008-09-24T13:30:28     2008-09-24T13:42:49     21081   19699       => 43:42.11 elapsed
20080924141603  26205           2008-09-24T14:18:42     2008-09-24T14:39:26     26327   24649       => 1:50:29 elapsed
20080924161725  10998           ?               ?       ?       ?


Can anyone tell me what is going on in the fetch stage after the urls have been fetched and parsed please? Can this be speeded up in any way?

Thanks,

Ed.




> Date: Tue, 23 Sep 2008 13:57:14 -0700
> From: [hidden email]
> To: [hidden email]
> Subject: Re: benchmarking
>
> I am using individual commands called from Java. I simply used Crawl.java as
> a starting point. Because I'm not using nutch for search I have already
> eliminated quite a few things such as building indexes and inverting links.
> All the fetching and subsequent lengthy operations are happening in
> Fetcher.fetch(segments, threads). My next step is to hack into that and see
> what's going on. When I crank up the logs I see a ton of map/reduce stuff
> happening.
>
> On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[hidden email]> wrote:
>
> > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[hidden email]>
> > wrote:
> > > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> > times
> > > from the log file:
> > >  - Injecting urls into the Crawl db: 1 min.
> > >  - Fetching: 46min
> > >  - Additional processing (unknown): 66min
> > >
> > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> > > day. The big block of additional processing happens after the last fetch.
> > I
> > > don't really know what Nutch is doing during that time. Parsing perhaps?
> > I
> > > would really like to know because that is killing my performance.
> > >
> >
> > Are you using "crawl" command? If you are serious about nutch, I would
> > suggest
> > that you use individual commands (inject/fetch/parse/etc). This should give
> > you
> > a better idea of what is taking so long.
> >
> > > Kevin
> > >
> > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[hidden email]
> > >wrote:
> > >
> > >> Edward,
> > >> I have been doing Crawl operations as opposed to the Fetch operation
> > you're
> > >> doing below. I think I am a little unclear on the difference. Since
> > you're
> > >> specifying a segment path when doing a Fetch does that mean you have
> > already
> > >> crawled? If we can break out the operations each of us are doing end to
> > end
> > >> perhaps we can get an apples to apples performance comparison. What I am
> > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> > Most
> > >> are from different hosts. I am finding that there are two main blocks of
> > >> computation time when I crawl: there is the fetching which seems to
> > happen
> > >> quite fast, and that is followed by a lengthy process where the CPU of
> > the
> > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> > parsing
> > >> at that point? Can you tell me what your operations are and what your
> > >> configuration is?
> > >>
> > >> Kevin
> > >>
> > >>
> > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]
> > >wrote:
> > >>
> > >>>
> > >>> Hi,
> > >>>
> > >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> > >>> expect different stages of a nutch crawl to take.
> > >>>
> > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> > cpu's,
> > >>> and 4GB ram. This is my nutch fetch process:
> > >>>
> > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> > >>> -Dhadoop.log.file=hadoop.log
> > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> > >>>
> > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty
 -ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar

> > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> > >>>
> > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> > around
> > >>> 1-2 hours. Does that seem reasonable or too slow?
> > >>>
> > >>> Thanks for any help.
> > >>>
> > >>> Ed.
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> _________________________________________________________________
> > >>> Make a mini you and download it into Windows Live Messenger
> > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > >>
> > >>
> > >>
> > >
> >
> >
> >
> > --
> > Doğacan Güney
> >

_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Reply | Threaded
Open this post in threaded view
|

RE: benchmarking

kevin chen-6

Are you fetching from same host? If url list are concentrated in few
hosts, because of politeness setting, a lot of time will spend on
waiting.

On Wed, 2008-09-24 at 15:35 +0000, Edward Quick wrote:

>
>
>
>
> Hi Kevin,
>
> Thanks for your reply.
>
> I haven't checked the other crawl stages as the bottleneck for me is the nutch fetch (mine's configured to do parsing and storing content). Also, I'm fetching from an F5 load-balanced pair of domino lotus notes intranet servers so the network speed is not a factor.
>
> The actual time "fetching/parsing" urls seems quick, around 1000 a minute with 20 threads, but then there's some processing after that, which seems to increase exponentially as the list gets bigger. I timed some of the fetches, and you can see here, how much the time increases ater I get to 20000 urls:
>
> bin/nutch readseg -list -dir crawl/segments/
> NAME            GENERATED       FETCHER START           FETCHER END             FETCHED PARSED       *ACTUAL TIME ELAPSED*
> 20080924121754  1               2008-09-24T12:18:04     2008-09-24T12:18:04     1       1
> 20080924121816  58              2008-09-24T12:18:47     2008-09-24T12:18:53     58      25
> 20080924121920  363             2008-09-24T12:19:40     2008-09-24T12:20:02     379     214
> 20080924122034  1987            2008-09-24T12:21:14     2008-09-24T12:22:50     2085    1434
> 20080924122344  4494            2008-09-24T12:24:25     2008-09-24T12:31:14     4598    4318
> 20080924125007  6802            2008-09-24T12:51:02     2008-09-24T13:01:32     6874    6462          => 14:43.16 elapsed
> 20080924131404  8170            2008-09-24T13:15:04     2008-09-24T13:22:29     8317    7802          => 13:26.59 elapsed
> 20080924132912  21065           2008-09-24T13:30:28     2008-09-24T13:42:49     21081   19699       => 43:42.11 elapsed
> 20080924141603  26205           2008-09-24T14:18:42     2008-09-24T14:39:26     26327   24649       => 1:50:29 elapsed
> 20080924161725  10998           ?               ?       ?       ?
>
>
> Can anyone tell me what is going on in the fetch stage after the urls have been fetched and parsed please? Can this be speeded up in any way?
>
> Thanks,
>
> Ed.
>
>
>
>
> > Date: Tue, 23 Sep 2008 13:57:14 -0700
> > From: [hidden email]
> > To: [hidden email]
> > Subject: Re: benchmarking
> >
> > I am using individual commands called from Java. I simply used Crawl.java as
> > a starting point. Because I'm not using nutch for search I have already
> > eliminated quite a few things such as building indexes and inverting links.
> > All the fetching and subsequent lengthy operations are happening in
> > Fetcher.fetch(segments, threads). My next step is to hack into that and see
> > what's going on. When I crank up the logs I see a ton of map/reduce stuff
> > happening.
> >
> > On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[hidden email]> wrote:
> >
> > > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[hidden email]>
> > > wrote:
> > > > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > > > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> > > times
> > > > from the log file:
> > > >  - Injecting urls into the Crawl db: 1 min.
> > > >  - Fetching: 46min
> > > >  - Additional processing (unknown): 66min
> > > >
> > > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> > > > day. The big block of additional processing happens after the last fetch.
> > > I
> > > > don't really know what Nutch is doing during that time. Parsing perhaps?
> > > I
> > > > would really like to know because that is killing my performance.
> > > >
> > >
> > > Are you using "crawl" command? If you are serious about nutch, I would
> > > suggest
> > > that you use individual commands (inject/fetch/parse/etc). This should give
> > > you
> > > a better idea of what is taking so long.
> > >
> > > > Kevin
> > > >
> > > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[hidden email]
> > > >wrote:
> > > >
> > > >> Edward,
> > > >> I have been doing Crawl operations as opposed to the Fetch operation
> > > you're
> > > >> doing below. I think I am a little unclear on the difference. Since
> > > you're
> > > >> specifying a segment path when doing a Fetch does that mean you have
> > > already
> > > >> crawled? If we can break out the operations each of us are doing end to
> > > end
> > > >> perhaps we can get an apples to apples performance comparison. What I am
> > > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> > > Most
> > > >> are from different hosts. I am finding that there are two main blocks of
> > > >> computation time when I crawl: there is the fetching which seems to
> > > happen
> > > >> quite fast, and that is followed by a lengthy process where the CPU of
> > > the
> > > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> > > parsing
> > > >> at that point? Can you tell me what your operations are and what your
> > > >> configuration is?
> > > >>
> > > >> Kevin
> > > >>
> > > >>
> > > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]
> > > >wrote:
> > > >>
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> > > >>> expect different stages of a nutch crawl to take.
> > > >>>
> > > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> > > cpu's,
> > > >>> and 4GB ram. This is my nutch fetch process:
> > > >>>
> > > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> > > >>> -Dhadoop.log.file=hadoop.log
> > > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> > > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> > > >>>
> > > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/jetty
>  -ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> > > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> > > >>>
> > > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> > > around
> > > >>> 1-2 hours. Does that seem reasonable or too slow?
> > > >>>
> > > >>> Thanks for any help.
> > > >>>
> > > >>> Ed.
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> _________________________________________________________________
> > > >>> Make a mini you and download it into Windows Live Messenger
> > > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> > > --
> > > Doğacan Güney
> > >
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/

Reply | Threaded
Open this post in threaded view
|

RE: benchmarking

Edward Quick

>
> Are you fetching from same host? If url list are concentrated in few
> hosts, because of politeness setting, a lot of time will spend on
> waiting.
>

Yes I am fetching from the same host. These are my nutch-site settings which should hopefully override the politeness settings:

fetcher.server.delay 0.01
fetcher.threads.fetch 10
fetcher.threads.per.host 50
fetcher.store.content true
fetcher.parse true
db.ignore.internal.links false
db.ignore.external.links true
db.max.outlinks.per.page -1
file.content.limit -1
http.content.limit -1
http.useHttp11 true
http.redirect.max 5
http.timeout 10000



> >
> > Hi Kevin,
> >
> > Thanks for your reply.
> >
> > I haven't checked the other crawl stages as the bottleneck for me is the nutch fetch (mine's configured to do parsing and storing content). Also, I'm fetching from an F5 load-balanced pair of domino lotus notes intranet servers so the network speed is not a factor.
> >
> > The actual time "fetching/parsing" urls seems quick, around 1000 a minute with 20 threads, but then there's some processing after that, which seems to increase exponentially as the list gets bigger. I timed some of the fetches, and you can see here, how much the time increases ater I get to 20000 urls:
> >
> > bin/nutch readseg -list -dir crawl/segments/
> > NAME            GENERATED       FETCHER START           FETCHER END             FETCHED PARSED       *ACTUAL TIME ELAPSED*
> > 20080924121754  1               2008-09-24T12:18:04     2008-09-24T12:18:04     1       1
> > 20080924121816  58              2008-09-24T12:18:47     2008-09-24T12:18:53     58      25
> > 20080924121920  363             2008-09-24T12:19:40     2008-09-24T12:20:02     379     214
> > 20080924122034  1987            2008-09-24T12:21:14     2008-09-24T12:22:50     2085    1434
> > 20080924122344  4494            2008-09-24T12:24:25     2008-09-24T12:31:14     4598    4318
> > 20080924125007  6802            2008-09-24T12:51:02     2008-09-24T13:01:32     6874    6462          => 14:43.16 elapsed
> > 20080924131404  8170            2008-09-24T13:15:04     2008-09-24T13:22:29     8317    7802          => 13:26.59 elapsed
> > 20080924132912  21065           2008-09-24T13:30:28     2008-09-24T13:42:49     21081   19699       => 43:42.11 elapsed
> > 20080924141603  26205           2008-09-24T14:18:42     2008-09-24T14:39:26     26327   24649       => 1:50:29 elapsed
> > 20080924161725  10998           ?               ?       ?       ?
> >
> >
> > Can anyone tell me what is going on in the fetch stage after the urls have been fetched and parsed please? Can this be speeded up in any way?
> >
> > Thanks,
> >
> > Ed.
> >
> >
> >
> >
> > > Date: Tue, 23 Sep 2008 13:57:14 -0700
> > > From: [hidden email]
> > > To: [hidden email]
> > > Subject: Re: benchmarking
> > >
> > > I am using individual commands called from Java. I simply used Crawl.java as
> > > a starting point. Because I'm not using nutch for search I have already
> > > eliminated quite a few things such as building indexes and inverting links.
> > > All the fetching and subsequent lengthy operations are happening in
> > > Fetcher.fetch(segments, threads). My next step is to hack into that and see
> > > what's going on. When I crank up the logs I see a ton of map/reduce stuff
> > > happening.
> > >
> > > On Tue, Sep 23, 2008 at 12:54 PM, Doğacan Güney <[hidden email]> wrote:
> > >
> > > > On Tue, Sep 23, 2008 at 8:51 PM, Kevin MacDonald <[hidden email]>
> > > > wrote:
> > > > > Some additional info. We are running Nutch on one of Amazon's EC2 small
> > > > > instances, which has the equivalent CPU capacity of a 1.0-1.2 GHz 2007
> > > > > Opteron or 2007 Xeon processor with 1.7GB of RAM.
> > > > > To crawl 35,000 Urls to a depth of 1 here's a breakdown of processing
> > > > times
> > > > > from the log file:
> > > > >  - Injecting urls into the Crawl db: 1 min.
> > > > >  - Fetching: 46min
> > > > >  - Additional processing (unknown): 66min
> > > > >
> > > > > Fetching is happening at a rate of about 760 Urls/min, or 1.1 million per
> > > > > day. The big block of additional processing happens after the last fetch.
> > > > I
> > > > > don't really know what Nutch is doing during that time. Parsing perhaps?
> > > > I
> > > > > would really like to know because that is killing my performance.
> > > > >
> > > >
> > > > Are you using "crawl" command? If you are serious about nutch, I would
> > > > suggest
> > > > that you use individual commands (inject/fetch/parse/etc). This should give
> > > > you
> > > > a better idea of what is taking so long.
> > > >
> > > > > Kevin
> > > > >
> > > > > On Tue, Sep 23, 2008 at 10:14 AM, Kevin MacDonald <[hidden email]
> > > > >wrote:
> > > > >
> > > > >> Edward,
> > > > >> I have been doing Crawl operations as opposed to the Fetch operation
> > > > you're
> > > > >> doing below. I think I am a little unclear on the difference. Since
> > > > you're
> > > > >> specifying a segment path when doing a Fetch does that mean you have
> > > > already
> > > > >> crawled? If we can break out the operations each of us are doing end to
> > > > end
> > > > >> perhaps we can get an apples to apples performance comparison. What I am
> > > > >> doing is crawling a list of perhaps 10,000 Urls to a depth of 1 only.
> > > > Most
> > > > >> are from different hosts. I am finding that there are two main blocks of
> > > > >> computation time when I crawl: there is the fetching which seems to
> > > > happen
> > > > >> quite fast, and that is followed by a lengthy process where the CPU of
> > > > the
> > > > >> machine is at 100%, but I'm not sure what it's doing. Perhaps it's
> > > > parsing
> > > > >> at that point? Can you tell me what your operations are and what your
> > > > >> configuration is?
> > > > >>
> > > > >> Kevin
> > > > >>
> > > > >>
> > > > >> On Tue, Sep 23, 2008 at 4:54 AM, Edward Quick <[hidden email]
> > > > >wrote:
> > > > >>
> > > > >>>
> > > > >>> Hi,
> > > > >>>
> > > > >>> Has anyone tried benchmarking nutch? I just wondered how long I should
> > > > >>> expect different stages of a nutch crawl to take.
> > > > >>>
> > > > >>> For example, I'm running Nutch on RHEL4 machine with 4 intel 2Ghz
> > > > cpu's,
> > > > >>> and 4GB ram. This is my nutch fetch process:
> > > > >>>
> > > > >>> /usr/jdk1.5.0_10/bin/java -Xmx2000m -Dhadoop.log.dir=/nutch/search/logs
> > > > >>> -Dhadoop.log.file=hadoop.log
> > > > >>> -Djava.library.path=/nutch/search/lib/native/Linux-i386-32
> > > > >>> -Dhadoop.tmp.dir=/nutch/tmp -Djava.io.tmpdir=/nutch/tmp -classpath
> > > > >>>
> > > > /nutch/search:/nutch/search/conf:/usr/jdk1.5.0_10/lib/tools.jar:/nutch/search/build:/nutch/search/build/test/classes:/nutch/search/build/nutch-1.0-dev.job:/nutch/search/nutch-*.job:/nutch/search/lib/commons-cli-2.0-SNAPSHOT.jar:/nutch/search/lib/commons-codec-1.3.jar:/nutch/search/lib/commons-httpclient-3.0.1.jar:/nutch/search/lib/commons-lang-2.1.jar:/nutch/search/lib/commons-logging-1.0.4.jar:/nutch/search/lib/commons-logging-api-1.0.4.jar:/nutch/search/lib/hadoop-0.17.1-core.jar:/nutch/search/lib/icu4j-3_6.jar:/nutch/search/lib/jakarta-oro-2.0.7.jar:/nutch/search/lib/jets3t-0.5.0.jar:/nutch/search/lib/jetty-5.1.4.jar:/nutch/search/lib/junit-3.8.1.jar:/nutch/search/lib/log4j-1.2.13.jar:/nutch/search/lib/lucene-core-2.3.0.jar:/nutch/search/lib/lucene-misc-2.3.0.jar:/nutch/search/lib/servlet-api.jar:/nutch/search/lib/taglibs-i18n.jar:/nutch/search/lib/tika-0.1-incubating.jar:/nutch/search/lib/xerces-2_6_2-apis.jar:/nutch/search/lib/xerces-2_6_2.jar:/nutch/search/lib/j
 etty

> >  -ext/ant.jar:/nutch/search/lib/jetty-ext/commons-el.jar:/nutch/search/lib/jetty-ext/jasper-compiler.jar:/nutch/search/lib/jetty-ext/jasper-runtime.jar:/nutch/search/lib/jetty-ext/jsp-api.jar
> > > > >>> org.apache.nutch.fetcher.Fetcher crawl/segments/20080923105853
> > > > >>>
> > > > >>> and a fetch of about 100,000 pages (with 20 threads per host) takes
> > > > around
> > > > >>> 1-2 hours. Does that seem reasonable or too slow?
> > > > >>>
> > > > >>> Thanks for any help.
> > > > >>>
> > > > >>> Ed.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> _________________________________________________________________
> > > > >>> Make a mini you and download it into Windows Live Messenger
> > > > >>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Doğacan Güney
> > > >
> >
> > _________________________________________________________________
> > Get all your favourite content with the slick new MSN Toolbar - FREE
> > http://clk.atdmt.com/UKM/go/111354027/direct/01/
>

_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/