Crawling after a period of time

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Crawling after a period of time

k-team
hi all,

       I wanted to know how to assign a period of 'recrawl time' to sites.

for example I want to crawl every day a specific site and every week another.

another thing is... how nutch decides to recrawl a site?

thanks.

ciao,
Marco
Reply | Threaded
Open this post in threaded view
|

Re: Crawling after a period of time

Jack.Tang
Hi Marco

You can use operation system built in scheduler such as crontab in
Unix, or some java lib such as Quartz.

/Jack

On 5/27/05, k-team <[hidden email]> wrote:

> hi all,
>
>       I wanted to know how to assign a period of 'recrawl time' to sites.
>
> for example I want to crawl every day a specific site and every week another.
>
> another thing is... how nutch decides to recrawl a site?
>
> thanks.
>
> ciao,
> Marco
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawling after a period of time

k-team
hi Jack,

> You can use operation system built in scheduler such as crontab in
> Unix, or some java lib such as Quartz.

mmm maybe I have explained myself badly. yeah, I know cron but I was
wondering how nutch decides to recrawl -- for example -- urls that are
one week old.

thanks.

ciao,
Marco
Reply | Threaded
Open this post in threaded view
|

Re: Crawling after a period of time

Jack.Tang
Hi

In my project, I really re-crawl the website everytime, and add one
url dedup listener to the crawl job. I mean when nutch finishes the
crawl web site, url dedup follows.

Any good idea?

/Jack

On 5/27/05, k-team <[hidden email]> wrote:

> hi Jack,
>
> > You can use operation system built in scheduler such as crontab in
> > Unix, or some java lib such as Quartz.
>
> mmm maybe I have explained myself badly. yeah, I know cron but I was
> wondering how nutch decides to recrawl -- for example -- urls that are
> one week old.
>
> thanks.
>
> ciao,
> Marco
>
Reply | Threaded
Open this post in threaded view
|

Re: Crawling after a period of time

k-team
hey Jack,

> In my project, I really re-crawl the website everytime, and add one
> url dedup listener to the crawl job. I mean when nutch finishes the
> crawl web site, url dedup follows.

ok, that's fine.

I have another doubt: nutch crawls at different depths every time you
order to it to crawl.
how you notice when it restarts from the beginning? maybe I have to
supply as argument old segments?

thanks.

ciao,
Marco
Reply | Threaded
Open this post in threaded view
|

Recommended UrlFilters

quovadis
Anyone got any recommended or tried and tested
urlfilters/normalizers for whole-web crawling?

Thanks
_____________________________________________________________________
For super low premiums, click here http://www.dialdirect.co.za/quote