A couple of basic questions re scheduled crawls.

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

A couple of basic questions re scheduled crawls.

Fred Zimmerman-3
I have a couple of very basic questions about scheduled crawls.

every crawled page has a scheduled fetch date (?).  How do I know that
nutch actually went out and crawled it?  How does nutch know to do so -- is
there a cron job?  Do I have to explicitly issue a fetch command, or are
all future fetch commands scheduled the first time I do bin/crawl?

https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
--
Fred Zimmerman
Reply | Threaded
Open this post in threaded view
|

Re: A couple of basic questions re scheduled crawls.

Sebastian Nagel-2
Hi Fred,

as soon as you generate the fetch list (if you call bin/crawl this is done)
and the CrawlDb contains at this time items with a (re)fetch date in the past,
you'll get an non-empty fetch list and Nutch will (re)fetch those pages.

You always have to call bin/crawl explicitly. Of course, you may set up a cronjob
to call it.

Best,
Sebastian

On 07/26/2018 05:44 PM, Fred Zimmerman wrote:
> I have a couple of very basic questions about scheduled crawls.
>
> every crawled page has a scheduled fetch date (?).  How do I know that
> nutch actually went out and crawled it?  How does nutch know to do so -- is
> there a cron job?  Do I have to explicitly issue a fetch command, or are
> all future fetch commands scheduled the first time I do bin/crawl?
>
> https://www.pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
>