depth scoring filter

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

depth scoring filter

Michael Coffey
I am trying do develop a news crawler and I want to prohibit it from wandering too far away from the seed list that I provide.
It seems like I should use the DepthScoringFilter, but I am having trouble getting it to work. After a few crawl cycles, all the _depth_ metadata say either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look like depths.
I have added a scoring.depth.max property to nutch-site.xml.
<property>
  <name>scoring.depth.max</name>
  <value>3</value>
</property>

I have changed the plugin.includes list to contain scoring-depth instead of opic, and now it looks like this.

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-depth|urlnormalizer-(pass|regex|basic)</value>
</property>
This is all using Nutch 1.12.

What do I need to do to limit the crawl frontier so it won't go more than N hops from the seed list, if that is possible?

Reply | Threaded
Open this post in threaded view
|

Re: depth scoring filter

Jigal van Hemert | alterNET internet BV
Hi,

On 20 September 2017 at 06:36, Michael Coffey <[hidden email]>
wrote:

> I am trying do develop a news crawler and I want to prohibit it from
> wandering too far away from the seed list that I provide.
> It seems like I should use the DepthScoringFilter, but I am having trouble
> getting it to work. After a few crawl cycles, all the _depth_ metadata say
> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look
> like depths.
> I have added a scoring.depth.max property to nutch-site.xml.
> <property>
>   <name>scoring.depth.max</name>
>   <value>3</value>
> </property>
>
>
I use the same plugin to only index seed plus one level below. The value
for this is 2 so your setup crawls seed plus two levels below.

I never looked at the values for the _depth_ metadata and frankly, because
it does what it's supposed to do, I personally don't care what it stores in
its metadata here.

What do I need to do to limit the crawl frontier so it won't go more than N
> hops from the seed list, if that is possible?
>
>
As said above, it should be enough to set the value to N+1.

--


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

[hidden email]
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !
Reply | Threaded
Open this post in threaded view
|

Re: depth scoring filter

Michael Coffey
I am still having trouble with the depth scoring filter, and now I have a simpler test case. It does work, somewhat, when I give it a list of 50 seed URLs, but when I give it a very short list, it fails.
I have tried depth.max values in the range of 1-6. None of them work for the short-list cases.

If my seed list contains just http://www.cnn.com/ 
it can do one generate/fetch/update cycle, but then fails saying "0 records selected for fetching" on the next cycle.
The same is true if I give it this short list of urlshttp://www.thedailybeast.com/
http://www.thedailybeast.com
https://thedailybeast.com/
https://thedailybeast.com

The same is true for this short list of urlshttps://nytimes.com/
http://www.nytimes.com/
https://www.nytimes.com/

In each case, the first cycle updates a reasonable-looking list of urls into the crawldb, so it seems strange that the depth filter stops it from selecting anything in subsequent rounds.
The cnn seed works fine when I use opic and not scoring-depth.

Here is a partial listing of the readdb dump from the failing cnn trial
http://www.cnn.com/    Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Sep 22 15:47:46 PDT 2017
Modified time: Thu Sep 21 15:47:46 PDT 2017
Retries since fetch: 0
Retry interval: 86400 seconds (1 days)
Score: 1.0
Signature: d9a6e1aaedca7795ea469dce4929704a
Metadata:
     _depth_=1
    _pst_=success(1), lastModified=0
    _rs_=77
    Content-Type=text/html
    _maxdepth_=3
    nutch.protocol.code=200

http://www.google.com/    Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Sep 21 15:49:13 PDT 2017
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.03125
Signature: null
Metadata:
     _depth_=1000
    _maxdepth_=3

http://www.googletagservices.com/    Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Sep 21 15:49:12 PDT 2017
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.03125
Signature: null
Metadata:
     _depth_=1000
    _maxdepth_=3

http://www.i.cdn.cnn.com/    Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Sep 21 15:49:13 PDT 2017
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.03125
Signature: null
Metadata:
     _depth_=1000
    _maxdepth_=3

http://www.ugdturner.com/    Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Sep 21 15:49:11 PDT 2017
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.03125
Signature: null
Metadata:
     _depth_=1000
    _maxdepth_=3

http://z.cdn.turner.com/    Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Sep 21 15:49:12 PDT 2017
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.03125
Signature: null
Metadata:
     _depth_=1000
    _maxdepth_=3

https://plus.google.com/+cnn/posts    Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Sep 21 15:49:13 PDT 2017
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 5184000 seconds (60 days)
Score: 0.03125
Signature: null
Metadata:
     _depth_=1000
    _maxdepth_=3






      From: Jigal van Hemert | alterNET internet BV <[hidden email]>
 To: user <[hidden email]>
 Sent: Tuesday, September 19, 2017 11:43 PM
 Subject: Re: depth scoring filter
   
Hi,

On 20 September 2017 at 06:36, Michael Coffey <[hidden email]>
wrote:

> I am trying do develop a news crawler and I want to prohibit it from
> wandering too far away from the seed list that I provide.
> It seems like I should use the DepthScoringFilter, but I am having trouble
> getting it to work. After a few crawl cycles, all the _depth_ metadata say
> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look
> like depths.
> I have added a scoring.depth.max property to nutch-site.xml.
> <property>
>  <name>scoring.depth.max</name>
>  <value>3</value>
> </property>
>
>
I use the same plugin to only index seed plus one level below. The value
for this is 2 so your setup crawls seed plus two levels below.

I never looked at the values for the _depth_ metadata and frankly, because
it does what it's supposed to do, I personally don't care what it stores in
its metadata here.

What do I need to do to limit the crawl frontier so it won't go more than N
> hops from the seed list, if that is possible?
>
>
As said above, it should be enough to set the value to N+1.

--


Met vriendelijke groet,


Jigal van Hemert | Ontwikkelaar



Langesteijn 124
3342LG Hendrik-Ido-Ambacht

T. +31 (0)78 635 1200
F. +31 (0)848 34 9697
KvK. 23 09 28 65

[hidden email]
www.alternet.nl


Disclaimer:
Dit bericht (inclusief eventuele bijlagen) kan vertrouwelijke informatie
bevatten. Als u niet de beoogde ontvanger bent van dit bericht, neem dan
direct per e-mail of telefoon contact op met de verzender en verwijder dit
bericht van uw systeem. Het is niet toegestaan de inhoud van dit bericht op
welke wijze dan ook te delen met derden of anderszins openbaar te maken
zonder schriftelijke toestemming van alterNET Internet BV. U wordt
geadviseerd altijd bijlagen te scannen op virussen. AlterNET kan op geen
enkele wijze verantwoordelijk worden gesteld voor geleden schade als gevolg
van virussen.

Alle eventueel genoemde prijzen S.E. & O., excl. 21% BTW, excl. reiskosten.
Op al onze prijsopgaven, offertes, overeenkomsten, en diensten zijn, met
uitzondering van alle andere voorwaarden, de Algemene Voorwaarden van
alterNET Internet B.V. van toepassing. Op al onze domeinregistraties en
hostingactiviteiten zijn tevens onze aanvullende hostingvoorwaarden van
toepassing. Dit bericht is uitsluitend bedoeld voor de geadresseerde. Aan
dit bericht kunnen geen rechten worden ontleend.

! Bedenk voordat je deze email uitprint, of dit werkelijk nodig is !


   
Reply | Threaded
Open this post in threaded view
|

Re: depth scoring filter

Sebastian Nagel
Hi Michael,

I've just tried it with 1.12 and the recent master of 1.x - works as expected,
Except for -  meta refresh redirects and when the fetcher isn't parsing.
Actually, this is an open issue since few months. I'll try to address it
the next days - https://issues.apache.org/jira/browse/NUTCH-2261

A little background what happens for meta refresh redirects:
 - the _depth_ is copied from the link source to the link target in the segment
 - when CrawlDb is updated with links and fetch status from the segment
 - _depth_=1000 is the fall-back if there is no _depth_ found in the segment's CrawlDatum

But there may be some other reason. Starting from http://www.cnn.com/ with 3 cycles I've got only
one page with the wired _depth_=1000.  Maybe try it slowly, cycle by cycle and check whether
one item in the CrawlDb gets wrong...

Best,
Sebastian

On 09/22/2017 04:57 AM, Michael Coffey wrote:

> I am still having trouble with the depth scoring filter, and now I have a simpler test case. It does work, somewhat, when I give it a list of 50 seed URLs, but when I give it a very short list, it fails.
> I have tried depth.max values in the range of 1-6. None of them work for the short-list cases.
>
> If my seed list contains just http://www.cnn.com/ 
> it can do one generate/fetch/update cycle, but then fails saying "0 records selected for fetching" on the next cycle.
> The same is true if I give it this short list of urlshttp://www.thedailybeast.com/
> http://www.thedailybeast.com
> https://thedailybeast.com/
> https://thedailybeast.com
>
> The same is true for this short list of urlshttps://nytimes.com/
> http://www.nytimes.com/
> https://www.nytimes.com/
>
> In each case, the first cycle updates a reasonable-looking list of urls into the crawldb, so it seems strange that the depth filter stops it from selecting anything in subsequent rounds.
> The cnn seed works fine when I use opic and not scoring-depth.
>
> Here is a partial listing of the readdb dump from the failing cnn trial
> http://www.cnn.com/    Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Sep 22 15:47:46 PDT 2017
> Modified time: Thu Sep 21 15:47:46 PDT 2017
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: d9a6e1aaedca7795ea469dce4929704a
> Metadata:
>      _depth_=1
>     _pst_=success(1), lastModified=0
>     _rs_=77
>     Content-Type=text/html
>     _maxdepth_=3
>     nutch.protocol.code=200
>
> http://www.google.com/    Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>     _maxdepth_=3
>
> http://www.googletagservices.com/    Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>     _maxdepth_=3
>
> http://www.i.cdn.cnn.com/    Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>     _maxdepth_=3
>
> http://www.ugdturner.com/    Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:11 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>     _maxdepth_=3
>
> http://z.cdn.turner.com/    Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>     _maxdepth_=3
>
> https://plus.google.com/+cnn/posts    Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>     _maxdepth_=3
>
>
>
>
>
>
>       From: Jigal van Hemert | alterNET internet BV <[hidden email]>
>  To: user <[hidden email]>
>  Sent: Tuesday, September 19, 2017 11:43 PM
>  Subject: Re: depth scoring filter
>    
> Hi,
>
> On 20 September 2017 at 06:36, Michael Coffey <[hidden email]>
> wrote:
>
>> I am trying do develop a news crawler and I want to prohibit it from
>> wandering too far away from the seed list that I provide.
>> It seems like I should use the DepthScoringFilter, but I am having trouble
>> getting it to work. After a few crawl cycles, all the _depth_ metadata say
>> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look
>> like depths.
>> I have added a scoring.depth.max property to nutch-site.xml.
>> <property>
>>   <name>scoring.depth.max</name>
>>   <value>3</value>
>> </property>
>>
>>
> I use the same plugin to only index seed plus one level below. The value
> for this is 2 so your setup crawls seed plus two levels below.
>
> I never looked at the values for the _depth_ metadata and frankly, because
> it does what it's supposed to do, I personally don't care what it stores in
> its metadata here.
>
> What do I need to do to limit the crawl frontier so it won't go more than N
>> hops from the seed list, if that is possible?
>>
>>
> As said above, it should be enough to set the value to N+1.
>

Reply | Threaded
Open this post in threaded view
|

Re: depth scoring filter

Michael Coffey
Thanks for that information. I've moved on to using regex-urlfilter instead of trying to filter by depth. It's probably better for what I'm trying to do anyway.


      From: Sebastian Nagel <[hidden email]>
 To: [hidden email]
 Sent: Monday, September 25, 2017 9:36 AM
 Subject: Re: depth scoring filter
   
Hi Michael,

I've just tried it with 1.12 and the recent master of 1.x - works as expected,
Except for -  meta refresh redirects and when the fetcher isn't parsing.
Actually, this is an open issue since few months. I'll try to address it
the next days - https://issues.apache.org/jira/browse/NUTCH-2261

A little background what happens for meta refresh redirects:
 - the _depth_ is copied from the link source to the link target in the segment
 - when CrawlDb is updated with links and fetch status from the segment
 - _depth_=1000 is the fall-back if there is no _depth_ found in the segment's CrawlDatum

But there may be some other reason. Starting from http://www.cnn.com/ with 3 cycles I've got only
one page with the wired _depth_=1000.  Maybe try it slowly, cycle by cycle and check whether
one item in the CrawlDb gets wrong...

Best,
Sebastian

On 09/22/2017 04:57 AM, Michael Coffey wrote:

> I am still having trouble with the depth scoring filter, and now I have a simpler test case. It does work, somewhat, when I give it a list of 50 seed URLs, but when I give it a very short list, it fails.
> I have tried depth.max values in the range of 1-6. None of them work for the short-list cases.
>
> If my seed list contains just http://www.cnn.com/ 
> it can do one generate/fetch/update cycle, but then fails saying "0 records selected for fetching" on the next cycle.
> The same is true if I give it this short list of urlshttp://www.thedailybeast.com/
> http://www.thedailybeast.com
> https://thedailybeast.com/
> https://thedailybeast.com
>
> The same is true for this short list of urlshttps://nytimes.com/
> http://www.nytimes.com/
> https://www.nytimes.com/
>
> In each case, the first cycle updates a reasonable-looking list of urls into the crawldb, so it seems strange that the depth filter stops it from selecting anything in subsequent rounds.
> The cnn seed works fine when I use opic and not scoring-depth.
>
> Here is a partial listing of the readdb dump from the failing cnn trial
> http://www.cnn.com/   Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri Sep 22 15:47:46 PDT 2017
> Modified time: Thu Sep 21 15:47:46 PDT 2017
> Retries since fetch: 0
> Retry interval: 86400 seconds (1 days)
> Score: 1.0
> Signature: d9a6e1aaedca7795ea469dce4929704a
> Metadata:
>      _depth_=1
>    _pst_=success(1), lastModified=0
>    _rs_=77
>    Content-Type=text/html
>    _maxdepth_=3
>    nutch.protocol.code=200
>
> http://www.google.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>    _maxdepth_=3
>
> http://www.googletagservices.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>    _maxdepth_=3
>
> http://www.i.cdn.cnn.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>    _maxdepth_=3
>
> http://www.ugdturner.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:11 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>    _maxdepth_=3
>
> http://z.cdn.turner.com/   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:12 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>    _maxdepth_=3
>
> https://plus.google.com/+cnn/posts   Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Thu Sep 21 15:49:13 PDT 2017
> Modified time: Wed Dec 31 16:00:00 PST 1969
> Retries since fetch: 0
> Retry interval: 5184000 seconds (60 days)
> Score: 0.03125
> Signature: null
> Metadata:
>      _depth_=1000
>    _maxdepth_=3
>
>
>
>
>
>
>      From: Jigal van Hemert | alterNET internet BV <[hidden email]>
>  To: user <[hidden email]>
>  Sent: Tuesday, September 19, 2017 11:43 PM
>  Subject: Re: depth scoring filter
>   
> Hi,
>
> On 20 September 2017 at 06:36, Michael Coffey <[hidden email]>
> wrote:
>
>> I am trying do develop a news crawler and I want to prohibit it from
>> wandering too far away from the seed list that I provide.
>> It seems like I should use the DepthScoringFilter, but I am having trouble
>> getting it to work. After a few crawl cycles, all the _depth_ metadata say
>> either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look
>> like depths.
>> I have added a scoring.depth.max property to nutch-site.xml.
>> <property>
>>  <name>scoring.depth.max</name>
>>  <value>3</value>
>> </property>
>>
>>
> I use the same plugin to only index seed plus one level below. The value
> for this is 2 so your setup crawls seed plus two levels below.
>
> I never looked at the values for the _depth_ metadata and frankly, because
> it does what it's supposed to do, I personally don't care what it stores in
> its metadata here.
>
> What do I need to do to limit the crawl frontier so it won't go more than N
>> hops from the seed list, if that is possible?
>>
>>
> As said above, it should be enough to set the value to N+1.
>