Nutch STOP conditions

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Nutch STOP conditions

brainstorm-2-2
There's something left I want to ask that I haven't found clearly
explained on FAQ nor mailing list:

Nutch STOP conditions, meaning: "how to stop a running nutch crawl"

In other words, how to define crawl:

1) "time limit": Crawl for Q hours and stop
2) "segments limit": After generating N segments, stop
3) "space limit": After M megabytes/space on DFS used, stop.
4) "input urls limit": After crawling Z urls from the original (seed)
input set, stop.
5) "depth limit": After reaching crawling depth X "far away" from
original input url list, stop.

More "limits" doubts/suggestions are welcome ;)

I'll put the answer(s) on Nutch wiki (FAQ section) if you don't mind,
I think it could clarify this spot to lots of people on the mailing
list (me included ! :-S).
Reply | Threaded
Open this post in threaded view
|

Re: Nutch STOP conditions

brainstorm-2-2
Gonna try to reply myself :-S... if you have simpler ways to do it,
your advise will be more than welcome.

On Fri, Aug 22, 2008 at 1:34 PM, brainstorm <[hidden email]> wrote:
> There's something left I want to ask that I haven't found clearly
> explained on FAQ nor mailing list:
>
> Nutch STOP conditions, meaning: "how to stop a running nutch crawl"
>
> In other words, how to define crawl:
>
> 1) "time limit": Crawl for Q hours and stop


at-based time limit ? nutch/bin/stop-all.sh... kind of drastic, but I
guess it works, but I don't know if the resulting data is consistent
:-S


> 2) "segments limit": After generating N segments, stop


No clue.


> 3) "space limit": After M megabytes/space on DFS used, stop.


Some hooks on ganglia metrics + pooling to determine if we're out of space ?


> 4) "input urls limit": After crawling Z urls from the original (seed)
> input set, stop.


-depth 1 ?


> More "limits" doubts/suggestions are welcome ;)
>
> I'll put the answer(s) on Nutch wiki (FAQ section) if you don't mind,
> I think it could clarify this spot to lots of people on the mailing
> list (me included ! :-S).
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch STOP conditions

brainstorm-2-2
I really need to know how to stop the crawling (cleanly, if possible),
when I have a predefined number of urls on LinkDB... is there anyway
to address that using nutch cmdline tool ? Do I have to extend the
Crawl class and write it on my own ?

Thanks in advance,
Roman

On Thu, Oct 2, 2008 at 5:36 PM, brainstorm <[hidden email]> wrote:

> Gonna try to reply myself :-S... if you have simpler ways to do it,
> your advise will be more than welcome.
>
> On Fri, Aug 22, 2008 at 1:34 PM, brainstorm <[hidden email]> wrote:
>> There's something left I want to ask that I haven't found clearly
>> explained on FAQ nor mailing list:
>>
>> Nutch STOP conditions, meaning: "how to stop a running nutch crawl"
>>
>> In other words, how to define crawl:
>>
>> 1) "time limit": Crawl for Q hours and stop
>
>
> at-based time limit ? nutch/bin/stop-all.sh... kind of drastic, but I
> guess it works, but I don't know if the resulting data is consistent
> :-S
>
>
>> 2) "segments limit": After generating N segments, stop
>
>
> No clue.
>
>
>> 3) "space limit": After M megabytes/space on DFS used, stop.
>
>
> Some hooks on ganglia metrics + pooling to determine if we're out of space ?
>
>
>> 4) "input urls limit": After crawling Z urls from the original (seed)
>> input set, stop.
>
>
> -depth 1 ?
>
>
>> More "limits" doubts/suggestions are welcome ;)
>>
>> I'll put the answer(s) on Nutch wiki (FAQ section) if you don't mind,
>> I think it could clarify this spot to lots of people on the mailing
>> list (me included ! :-S).
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Nutch STOP conditions

Lyndon Maydwell
I haven't seen any documentation on this either unfortunately, but you
can probably kill a crawl without too much drama as long as you ensure
that the rest of your script tidies up any half finished steps then
exits.

On Wed, Dec 17, 2008 at 7:48 PM, brainstorm <[hidden email]> wrote:

> I really need to know how to stop the crawling (cleanly, if possible),
> when I have a predefined number of urls on LinkDB... is there anyway
> to address that using nutch cmdline tool ? Do I have to extend the
> Crawl class and write it on my own ?
>
> Thanks in advance,
> Roman
>
> On Thu, Oct 2, 2008 at 5:36 PM, brainstorm <[hidden email]> wrote:
>> Gonna try to reply myself :-S... if you have simpler ways to do it,
>> your advise will be more than welcome.
>>
>> On Fri, Aug 22, 2008 at 1:34 PM, brainstorm <[hidden email]> wrote:
>>> There's something left I want to ask that I haven't found clearly
>>> explained on FAQ nor mailing list:
>>>
>>> Nutch STOP conditions, meaning: "how to stop a running nutch crawl"
>>>
>>> In other words, how to define crawl:
>>>
>>> 1) "time limit": Crawl for Q hours and stop
>>
>>
>> at-based time limit ? nutch/bin/stop-all.sh... kind of drastic, but I
>> guess it works, but I don't know if the resulting data is consistent
>> :-S
>>
>>
>>> 2) "segments limit": After generating N segments, stop
>>
>>
>> No clue.
>>
>>
>>> 3) "space limit": After M megabytes/space on DFS used, stop.
>>
>>
>> Some hooks on ganglia metrics + pooling to determine if we're out of space ?
>>
>>
>>> 4) "input urls limit": After crawling Z urls from the original (seed)
>>> input set, stop.
>>
>>
>> -depth 1 ?
>>
>>
>>> More "limits" doubts/suggestions are welcome ;)
>>>
>>> I'll put the answer(s) on Nutch wiki (FAQ section) if you don't mind,
>>> I think it could clarify this spot to lots of people on the mailing
>>> list (me included ! :-S).
>>>
>>
>