Usage previous stage HostDb data for generate(fetched deltas)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Usage previous stage HostDb data for generate(fetched deltas)

Semyon Semyonov
Dear all,

I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage.

Lets say for each website we have condition of generate while number of fetched < 150.
The problem is for some websites that condition will (almost)never be finished, because of its structure.

For example
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page
...etc.

I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
Therefore in this case the process should stop on round 5 with total number of fetched equals to 92.

To make it I plan to modify updatehostdb function and add delta variable in hostdatum for fetched.

Do you think it is a good idea to make it in such a way?

Semyon.
Reply | Threaded
Open this post in threaded view
|

Fw: Usage previous stage HostDb data for generate(fetched deltas)

Semyon Semyonov
I have created an issue for this functionality:
https://issues.apache.org/jira/browse/NUTCH-2481
 
 

Sent: Thursday, December 14, 2017 at 2:07 PM
From: "Semyon Semyonov" <[hidden email]>
To: "usernutch.apache.org" <[hidden email]>
Subject: Usage previous stage HostDb data for generate(fetched deltas)
Dear all,

I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage.

Lets say for each website we have condition of generate while number of fetched < 150.
The problem is for some websites that condition will (almost)never be finished, because of its structure.

For example
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page
...etc.

I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
Therefore in this case the process should stop on round 5 with total number of fetched equals to 92.

To make it I plan to modify updatehostdb function and add delta variable in hostdatum for fetched.

Do you think it is a good idea to make it in such a way?

Semyon.
Reply | Threaded
Open this post in threaded view
|

RE: Usage previous stage HostDb data for generate(fetched deltas)

Yossi Tamari
In reply to this post by Semyon Semyonov
Hi Semyon,

Maybe I'm missing the point, but I don't see why you would want to do this.
On one hand, if there is only 1 URL per cycle, why not fetch it? The cost is negligible.
On the other hand, imagine this scenario: You find the first link to some host from another host, and you crawl it. But it happens to be some "leaf" document that has no links (or maybe it has an homepage link only), so your delta condition is not satisfied. Later you find another link to this host from another host, this time to the homepage, where you can find all the "good" links, but you will not crawl it, because your delta condition is still not satisfied.
What am I missing?

        Yossi.

> -----Original Message-----
> From: Semyon Semyonov [mailto:[hidden email]]
> Sent: 14 December 2017 15:08
> To: usernutch.apache.org <[hidden email]>
> Subject: Usage previous stage HostDb data for generate(fetched deltas)
>
> Dear all,
>
> I plan to improve hostdb functionality to have a DB_FETCHED delta for generate
> stage.
>
> Lets say for each website we have condition of generate while number of
> fetched < 150.
> The problem is for some websites that condition will (almost)never be finished,
> because of its structure.
>
> For example
> 1) Round1. 1 page
> 2) Round2. 10 pages
> 3) Round3. 80 pages
> 4) Round 4. 1 page
> 5) Round 5. 1 page
> ...etc.
>
> I would like to add the delta condition for fetched that describes speed of the
> process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
> Therefore in this case the process should stop on round 5 with total number of
> fetched equals to 92.
>
> To make it I plan to modify updatehostdb function and add delta variable in
> hostdatum for fetched.
>
> Do you think it is a good idea to make it in such a way?
>
> Semyon.

Reply | Threaded
Open this post in threaded view
|

Re: RE: Usage previous stage HostDb data for generate(fetched deltas)

Semyon Semyonov
Hi Yossi,

What you say makes sense if you run Nutch in the "whole Internet crawling" mode. In other words, you don't specify the set of hosts you want to crawl, but crawl up to infinity.

Our case is different. We crawl the specific hosts per each country(around 200000). For each host we set up a stop condition in generate, with the expression based on fetched number per host, lets say db_fetched < 100(see https://issues.apache.org/jira/browse/NUTCH-2368).

The problem is for really deep websites this condition can be hard(never in practice) to satisfy. As an illustration, imagine a website with the following structure 1-10-15-5-1-1-1 - ...

Therefore I want to have a mechanism to stop at specific point with this host even though the db_fetched condition is not satisfied yet.

Semyon.
Reply | Threaded
Open this post in threaded view
|

Re: Usage previous stage HostDb data for generate(fetched deltas)

Semyon Semyonov
In reply to this post by Semyon Semyonov
I have proposed a solution for this(https://issues.apache.org/jira/browse/NUTCH-2481).

With this commit we are capable of using deltas stastics of hostdb(hostdb before update and after) and calculate the differences that saved in the metadata.

For example to use fetched deltas in generate.

1) To calculate FetchedDelta in the hostdb update
<property>
  <name>hostdb.deltaExpression</name>
  <value>{return new ("javafx.util.Pair","FetchedDelta", currentHostDatum.fetched - previousHostDatum.fetched);}</value>
</property>

2) To use FetchedDelta in generate to not crawl the websites with FetchedDelta < 5

<property>
 <name>generate.max.count.expr</name>  
<value> if(fetched > 70 &#038;&#038; FetchedDelta &#60; 5 ) {return new("java.lang.Double", 0);} else {return conf.getDouble("generate.max.count", -1);} </value>
</property>

The commit should be tested though. So, feel free to test/modify. 
 

Sent: Thursday, December 14, 2017 at 2:07 PM
From: "Semyon Semyonov" <[hidden email]>
To: "usernutch.apache.org" <[hidden email]>
Subject: Usage previous stage HostDb data for generate(fetched deltas)
Dear all,

I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage.

Lets say for each website we have condition of generate while number of fetched < 150.
The problem is for some websites that condition will (almost)never be finished, because of its structure.

For example
1) Round1. 1 page
2) Round2. 10 pages
3) Round3. 80 pages
4) Round 4. 1 page
5) Round 5. 1 page
...etc.

I would like to add the delta condition for fetched that describes speed of the process. Lets say generate while number of fetched < 150 && delta_fetched > 1.
Therefore in this case the process should stop on round 5 with total number of fetched equals to 92.

To make it I plan to modify updatehostdb function and add delta variable in hostdatum for fetched.

Do you think it is a good idea to make it in such a way?

Semyon.