Speeding things up!

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Speeding things up!

ocramp
Hi,

 Do you have some hints that would improve speed for the following nutch
commands?

 ./nutch generate db segments -topN 10000000
s=`ls -d segments/2* | tail -1`
./nutch fetch $s
./nutch updatedb db $s
./nutch index $s
./nutch dedup segments tmpfile

 I mean, do you have some hints for the numbers set in
nutch-default.xmlfor, for example:
fetcher.threads (I'm using 10.000), etc....
 Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.

Thank you very much for any help.

Marco
Reply | Threaded
Open this post in threaded view
|

Re: Speeding things up!

Sami Siren-2
Some simple rules for generally speeding things up

1. Crawl only the content you are going to handle handle (do not fetch
for example pdf-files if you don't need them, also disable all unneeded
parsers)

2. If using regex-urlfilter: If you don't need the rule
"-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as
small as possible still remembering #1 and #3)

3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end
up parsing all kinds of binary content with text parser.

You might also check the variables like "fetcher.server.delay" and
"fetcher.threads.per.host". (and remember to keep your fetcher polite!)

I am using something like 300 for "fetcher.threads" for fetching with
0.8.1 single athlon 64, 1 GB of memory.

I am also in process of fixing some IO related bottlenecks and will get
back to that hopefully sooner than later.

--
  Sami Siren




Marco Vanossi wrote:

> Hi,
>
> Do you have some hints that would improve speed for the following nutch
> commands?
>
> ./nutch generate db segments -topN 10000000
> s=`ls -d segments/2* | tail -1`
> ./nutch fetch $s
> ./nutch updatedb db $s
> ./nutch index $s
> ./nutch dedup segments tmpfile
>
> I mean, do you have some hints for the numbers set in
> nutch-default.xmlfor, for example:
> fetcher.threads (I'm using 10.000), etc....
> Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.
>
> Thank you very much for any help.
>
> Marco
>

Reply | Threaded
Open this post in threaded view
|

Re: Speeding things up!

Sami Siren-2
forgot one important one:

set "generate.max.per.host" to something reasonable so you won't end up
fetching urls from only low number of hosts which by default is very slow.

--
  Sami Siren

Sami Siren wrote:

> Some simple rules for generally speeding things up
>
> 1. Crawl only the content you are going to handle handle (do not fetch
> for example pdf-files if you don't need them, also disable all unneeded
> parsers)
>
> 2. If using regex-urlfilter: If you don't need the rule
> "-.*(/.+?)/.*?\1/.*?\1/" remove it (also keep the number of rules as
> small as possible still remembering #1 and #3)
>
> 3. Check your parser configuration (SEE NUTCH-362) so your CPU won't end
> up parsing all kinds of binary content with text parser.
>
> You might also check the variables like "fetcher.server.delay" and
> "fetcher.threads.per.host". (and remember to keep your fetcher polite!)
>
> I am using something like 300 for "fetcher.threads" for fetching with
> 0.8.1 single athlon 64, 1 GB of memory.
>
> I am also in process of fixing some IO related bottlenecks and will get
> back to that hopefully sooner than later.
>
> --
>  Sami Siren
>
>
>
>
> Marco Vanossi wrote:
>> Hi,
>>
>> Do you have some hints that would improve speed for the following nutch
>> commands?
>>
>> ./nutch generate db segments -topN 10000000
>> s=`ls -d segments/2* | tail -1`
>> ./nutch fetch $s
>> ./nutch updatedb db $s
>> ./nutch index $s
>> ./nutch dedup segments tmpfile
>>
>> I mean, do you have some hints for the numbers set in
>> nutch-default.xmlfor, for example:
>> fetcher.threads (I'm using 10.000), etc....
>> Let's say it is running on a machine with 12GB RAM, and 2.000GB HD.
>>
>> Thank you very much for any help.
>>
>> Marco
>>
>
>