Can nutch pause, stop and start where it left off?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Can nutch pause, stop and start where it left off?

clusterboy
I am just staring to learn nutch.  One question I wanted to know is that can
nutch pause, stop and start indexing a site on a incremental  daily basis?
My concern with nutch is that nutch behaving like a hog and crawling
everything with huge bandwidth consumption and pissing off the many site
owners.

Can some experts shed some light in this?
Reply | Threaded
Open this post in threaded view
|

Re: Can nutch pause, stop and start where it left off?

Jesse Hires
use the -topN flag to only grab a small number of URLs.
Also I believe there is also a setting you can put in nutch-site.xml that
can be used to slow down how many URLs you grab over time.

Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop <[hidden email]> wrote:

> I am just staring to learn nutch.  One question I wanted to know is that
> can
> nutch pause, stop and start indexing a site on a incremental  daily basis?
> My concern with nutch is that nutch behaving like a hog and crawling
> everything with huge bandwidth consumption and pissing off the many site
> owners.
>
> Can some experts shed some light in this?
>
Reply | Threaded
Open this post in threaded view
|

Re: Can nutch pause, stop and start where it left off?

MilleBii
Nutch behaves ...
So by default it will not fetch more 1 url every 5s (setting
changeable)  to a given host (by name or ip depending on the nutch
conf file).
So actually you will find the opposite it is very slow for a single
site... Speed comes when you fetch several sites in parallel.


2009/12/4, Jesse Hires <[hidden email]>:

> use the -topN flag to only grab a small number of URLs.
> Also I believe there is also a setting you can put in nutch-site.xml that
> can be used to slow down how many URLs you grab over time.
>
> Jesse
>
> int GetRandomNumber()
> {
>    return 4; // Chosen by fair roll of dice
>                 // Guaranteed to be random
> } // xkcd.com
>
>
>
> On Fri, Dec 4, 2009 at 4:10 AM, Mr Hadoop <[hidden email]> wrote:
>
>> I am just staring to learn nutch.  One question I wanted to know is that
>> can
>> nutch pause, stop and start indexing a site on a incremental  daily basis?
>> My concern with nutch is that nutch behaving like a hog and crawling
>> everything with huge bandwidth consumption and pissing off the many site
>> owners.
>>
>> Can some experts shed some light in this?
>>
>


--
-MilleBii-