I managed to apply the issue, but I had to made small modification of the code(it didn't work for Nutch RestAPI, a patch is attached to the issue.)
I used the path with the following settings: <property> <name>generate.max.count.expr</name> <value> if(fetched > 120) {return new("java.lang.Double", 0);} else {return conf.getDouble("generate.max.count", -1);} </value> </property> That works and I stop on this approach though it adds one more step in the crawling process(update hostdb), but it seems like a necessary evil for time beign. Sent: Monday, October 23, 2017 at 2:57 PM From: "Markus Jelsma" <[hidden email]> To: "[hidden email]" <[hidden email]> Subject: RE: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth How about NUTCH-2368's variable generate.max.count based on HostDB data? Regards, Markus [1] https://issues.apache.org/jira/browse/NUTCH-2368 -----Original message----- > From:Semyon Semyonov <[hidden email]> > Sent: Monday 23rd October 2017 15:51 > To: [hidden email] > Subject: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth > > Hi, > > Im looking for the best way of restriction by amount of pages crawled per host. I have a list of hosts to crawl, lets say M hosts and I would like to limit crawling on each host as MaxPages. > The external links are turned off for the crawling processes. > > My own proposal can be found at 3) > > 1)Using https://www.mail-archive.com/user@.../msg10245.html[https://www.mail-archive.com/user@.../msg10245.html] > We know the size of the cluster(number of Nodes) and now the size of the list(M). > If we divide M/(number of Nodes in the cluster * number of fetches per Node) we can get the total amount of rounds for first level crawling(K). > Then we multiply this parameter on necessary number of level for the website(N = 2,3,4...) depending on how deep we want to get to the specific website. > Lets say to crawl all the list we need to have K = 500 rounds, we want to crawl each website up to 4th level N= 4, therefore the total amount of rounds KN = 2000 > Combining with generate.max.count = MaxPages we get maximum pages MaxPages * N. > Problem: the process should be smooth enough to guarantee the full list crawl for K rounds. Potential problems with crawling process and/or Hadoop cluster. > > 2) The second approach is to use hostdb https://www.mail-archive.com/user@.../msg14330.html[https://www.mail-archive.com/user@.../msg14330.html][https://www.mail-archive.com/user@.../msg14330.html[https://www.mail-archive.com/user@.../msg14330.html]] > Problem : that asks for additional computations for hostdb + workaround with the black list > > 3) My own solution, it is a bit tricky. > Using scoring-depth plugin extension and generate.min.score config. > > That plugin set up the weights of linked pages as ParrentWeight/Number of linked pages. The initial weight equals to 1 by default. > > My idea that we can estimate the maximum amount of page for the host. > To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 5 pages and 7 pages). > > 1 > / / \ \ > / / \ \ > / / \ \ > 1/4 1/4 1/4 1/4 > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ > 1/4 1/4 > > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ / \ > 1/4 1/4 1/4 1/4 > > The last tree gives maximum amount of pages with weight of 1/4( 3 levels each one sums up to 1). Total sum = 7. > The idea behind it is the maximum amount of links are given with the deepest tree.The deepest tree can be factorized on prime factors of the final weight. > > For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7. > For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13 > For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 > > The calculator: http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22][http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22]] > Problem : the score can be affected by other plugins. > > Thanks. > > Semyon. > |
