need to override refetch intervals

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

need to override refetch intervals

Michael Coffey
In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.


Here is the injection command I use:
$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt

After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.

I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?



For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.

http://mobile.reuters.com/
http://mobile.reuters.com/business
http://mobile.reuters.com/finance
http://mobile.reuters.com/news/entertainment
http://mobile.reuters.com/news/entertainment/arts
http://mobile.reuters.com/news/environment
http://mobile.reuters.com/news/health
http://mobile.reuters.com/news/lifestyle
http://mobile.reuters.com/news/oddlyEnough
http://mobile.reuters.com/news/science
http://mobile.reuters.com/news/sports
http://mobile.reuters.com/news/technology
http://mobile.reuters.com/news/us
http://mobile.reuters.com/news/world
http://mobile.reuters.com/politics
http://www.reuters.com/subjects/healthcare
https://www.reuters.com/
https://www.reuters.com/energy-environment
https://www.reuters.com/finance
https://www.reuters.com/money
https://www.reuters.com/news/entertainment
https://www.reuters.com/news/health
https://www.reuters.com/news/technology
https://www.reuters.com/news/world
https://www.reuters.com/politics
Reply | Threaded
Open this post in threaded view
|

Re: need to override refetch intervals

Michael Coffey
I also tried including metadata in the seeds file (TAB-delimited) as follows.


http://mobile.reuters.com/      nutch.score=100 nutch.fetchInterval=1800
http://mobile.reuters.com/business      nutch.score=100 nutch.fetchInterval=1800


So, I am still looking for a way to manipulate the refetch intervals and scores in the crawl db.


________________________________
From: Michael Coffey <[hidden email]>
To: User <[hidden email]>
Sent: Friday, November 24, 2017 3:13 PM
Subject: need to override refetch intervals



In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.



Here is the injection command I use:

$NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt


After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.


I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?




For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.


http://mobile.reuters.com/

http://mobile.reuters.com/business

http://mobile.reuters.com/finance

http://mobile.reuters.com/news/entertainment

http://mobile.reuters.com/news/entertainment/arts

http://mobile.reuters.com/news/environment

http://mobile.reuters.com/news/health

http://mobile.reuters.com/news/lifestyle

http://mobile.reuters.com/news/oddlyEnough

http://mobile.reuters.com/news/science

http://mobile.reuters.com/news/sports

http://mobile.reuters.com/news/technology

http://mobile.reuters.com/news/us

http://mobile.reuters.com/news/world

http://mobile.reuters.com/politics

http://www.reuters.com/subjects/healthcare

https://www.reuters.com/

https://www.reuters.com/energy-environment

https://www.reuters.com/finance

https://www.reuters.com/money

https://www.reuters.com/news/entertainment

https://www.reuters.com/news/health

https://www.reuters.com/news/technology

https://www.reuters.com/news/world

https://www.reuters.com/politics
Reply | Threaded
Open this post in threaded view
|

Re: need to override refetch intervals

Sebastian Nagel
Hi Michael,

> http://mobile.reuters.com/        nutch.score=100 nutch.fetchInterval=1800

works (make sure you have tabs as separators).

Of course, if the URLs are already in CrawlDb you need to "overwrite" them.

   nutch inject  ...   -overwrite
      -D db.injector.overwrite=true does not work because it's overwritten by
      -overwrite or is set to false if -overwrite is absent ;(

or "update"

   nutch inject  ...   -update
     (-update will only overwrite the fetch interval if it's not the default,
      otherwise it preserves the fetch interval which might have been changed adaptively)

Best,
Sebastian

On 11/27/2017 09:23 PM, Michael Coffey wrote:

> I also tried including metadata in the seeds file (TAB-delimited) as follows.
>
>
> http://mobile.reuters.com/      nutch.score=100 nutch.fetchInterval=1800
> http://mobile.reuters.com/business      nutch.score=100 nutch.fetchInterval=1800
>
>
> So, I am still looking for a way to manipulate the refetch intervals and scores in the crawl db.
>
>
> ________________________________
> From: Michael Coffey <[hidden email]>
> To: User <[hidden email]>
> Sent: Friday, November 24, 2017 3:13 PM
> Subject: need to override refetch intervals
>
>
>
> In order to achieve the most timely crawling of news sites, I want to be able to manipulate the refetch intervals and scores in the crawl db. I thought I could accomplish that by re-injecting the urls that should be re-fetched most often. According to the documentation, it seems I should be able to do that using the db.injector.overwrite property. However, it does not actually work for me.
>
>
>
> Here is the injection command I use:
>
> $NUTCH_HOME/runtime/deploy/bin/nutch inject -D db.score.injected=10 -D db.injector.overwrite=true -D db.fetch.interval.default=1800 /crawls/news0/data/crawldb /crawls/news0/seeds/reuters.txt
>
>
> After re-injecting, I inspect the crawldb dump and see that the intervals and scores have not been overwritten. I have also tried db.injector.overwrite=true, with similar results.
>
>
> I suspect that my db.fetch.interval.default does not affect existing urls. Is there any way to change the refetch intervals of existing urls?
>
>
>
>
> For a test case, one could inject a few of the following urls, crawl several iterations, and then inject all of them. The result should be that all of them have the 1800 interval.
>
>
> http://mobile.reuters.com/
>
> http://mobile.reuters.com/business
>
> http://mobile.reuters.com/finance
>
> http://mobile.reuters.com/news/entertainment
>
> http://mobile.reuters.com/news/entertainment/arts
>
> http://mobile.reuters.com/news/environment
>
> http://mobile.reuters.com/news/health
>
> http://mobile.reuters.com/news/lifestyle
>
> http://mobile.reuters.com/news/oddlyEnough
>
> http://mobile.reuters.com/news/science
>
> http://mobile.reuters.com/news/sports
>
> http://mobile.reuters.com/news/technology
>
> http://mobile.reuters.com/news/us
>
> http://mobile.reuters.com/news/world
>
> http://mobile.reuters.com/politics
>
> http://www.reuters.com/subjects/healthcare
>
> https://www.reuters.com/
>
> https://www.reuters.com/energy-environment
>
> https://www.reuters.com/finance
>
> https://www.reuters.com/money
>
> https://www.reuters.com/news/entertainment
>
> https://www.reuters.com/news/health
>
> https://www.reuters.com/news/technology
>
> https://www.reuters.com/news/world
>
> https://www.reuters.com/politics
>