Hi,
Im looking for the best way of restriction by amount of pages crawled per host. I have a list of hosts to crawl, lets say M hosts and I would like to limit crawling on each host as MaxPages. The external links are turned off for the crawling processes. My own proposal can be found at 3) 1)Using https://www.mail-archive.com/user@.../msg10245.html We know the size of the cluster(number of Nodes) and now the size of the list(M). If we divide M/(number of Nodes in the cluster * number of fetches per Node) we can get the total amount of rounds for first level crawling(K). Then we multiply this parameter on necessary number of level for the website(N = 2,3,4...) depending on how deep we want to get to the specific website. Lets say to crawl all the list we need to have K = 500 rounds, we want to crawl each website up to 4th level N= 4, therefore the total amount of rounds KN = 2000 Combining with generate.max.count = MaxPages we get maximum pages MaxPages * N. Problem: the process should be smooth enough to guarantee the full list crawl for K rounds. Potential problems with crawling process and/or Hadoop cluster. 2) The second approach is to use hostdb https://www.mail-archive.com/user@.../msg14330.html[https://www.mail-archive.com/user@.../msg14330.html] Problem : that asks for additional computations for hostdb + workaround with the black list 3) My own solution, it is a bit tricky. Using scoring-depth plugin extension and generate.min.score config. That plugin set up the weights of linked pages as ParrentWeight/Number of linked pages. The initial weight equals to 1 by default. My idea that we can estimate the maximum amount of page for the host. To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 5 pages and 7 pages). 1 / / \ \ / / \ \ / / \ \ 1/4 1/4 1/4 1/4 1 / \ / \ / \ 1/2 1/2 / \ 1/4 1/4 1 / \ / \ / \ 1/2 1/2 / \ / \ 1/4 1/4 1/4 1/4 The last tree gives maximum amount of pages with weight of 1/4( 3 levels each one sums up to 1). Total sum = 7. The idea behind it is the maximum amount of links are given with the deepest tree.The deepest tree can be factorized on prime factors of the final weight. For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7. For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13 For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 The calculator: http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22] Problem : the score can be affected by other plugins. Thanks. Semyon. |
How about NUTCH-2368's variable generate.max.count based on HostDB data?
Regards, Markus [1] https://issues.apache.org/jira/browse/NUTCH-2368 -----Original message----- > From:Semyon Semyonov <[hidden email]> > Sent: Monday 23rd October 2017 15:51 > To: [hidden email] > Subject: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth > > Hi, > > Im looking for the best way of restriction by amount of pages crawled per host. I have a list of hosts to crawl, lets say M hosts and I would like to limit crawling on each host as MaxPages. > The external links are turned off for the crawling processes. > > My own proposal can be found at 3) > > 1)Using https://www.mail-archive.com/user@.../msg10245.html > We know the size of the cluster(number of Nodes) and now the size of the list(M). > If we divide M/(number of Nodes in the cluster * number of fetches per Node) we can get the total amount of rounds for first level crawling(K). > Then we multiply this parameter on necessary number of level for the website(N = 2,3,4...) depending on how deep we want to get to the specific website. > Lets say to crawl all the list we need to have K = 500 rounds, we want to crawl each website up to 4th level N= 4, therefore the total amount of rounds KN = 2000 > Combining with generate.max.count = MaxPages we get maximum pages MaxPages * N. > Problem: the process should be smooth enough to guarantee the full list crawl for K rounds. Potential problems with crawling process and/or Hadoop cluster. > > 2) The second approach is to use hostdb https://www.mail-archive.com/user@.../msg14330.html[https://www.mail-archive.com/user@.../msg14330.html] > Problem : that asks for additional computations for hostdb + workaround with the black list > > 3) My own solution, it is a bit tricky. > Using scoring-depth plugin extension and generate.min.score config. > > That plugin set up the weights of linked pages as ParrentWeight/Number of linked pages. The initial weight equals to 1 by default. > > My idea that we can estimate the maximum amount of page for the host. > To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 5 pages and 7 pages). > > 1 > / / \ \ > / / \ \ > / / \ \ > 1/4 1/4 1/4 1/4 > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ > 1/4 1/4 > > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ / \ > 1/4 1/4 1/4 1/4 > > The last tree gives maximum amount of pages with weight of 1/4( 3 levels each one sums up to 1). Total sum = 7. > The idea behind it is the maximum amount of links are given with the deepest tree.The deepest tree can be factorized on prime factors of the final weight. > > For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7. > For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13 > For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 > > The calculator: http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22] > Problem : the score can be affected by other plugins. > > Thanks. > > Semyon. > |
In reply to this post by Semyon Semyonov
Thanks for the suggestion.
Could you explain how can I use it in the crawling process? Should I call generate with a specific parameter? It is not really clear from the issue. I use Nutch 1.13. Sent: Monday, October 23, 2017 at 3:57 PM From: "Markus Jelsma" <[hidden email]> To: "[hidden email]" <[hidden email]> Subject: RE: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth How about NUTCH-2368's variable generate.max.count based on HostDB data? Regards, Markus [1] https://issues.apache.org/jira/browse/NUTCH-2368 -----Original message----- > From:Semyon Semyonov <[hidden email]> > Sent: Monday 23rd October 2017 15:51 > To: [hidden email] > Subject: Ways of limit pages per host. generate.max.count, hostdb, scoring-depth > > Hi, > > Im looking for the best way of restriction by amount of pages crawled per host. I have a list of hosts to crawl, lets say M hosts and I would like to limit crawling on each host as MaxPages. > The external links are turned off for the crawling processes. > > My own proposal can be found at 3) > > 1)Using https://www.mail-archive.com/user@.../msg10245.html[https://www.mail-archive.com/user@.../msg10245.html] > We know the size of the cluster(number of Nodes) and now the size of the list(M). > If we divide M/(number of Nodes in the cluster * number of fetches per Node) we can get the total amount of rounds for first level crawling(K). > Then we multiply this parameter on necessary number of level for the website(N = 2,3,4...) depending on how deep we want to get to the specific website. > Lets say to crawl all the list we need to have K = 500 rounds, we want to crawl each website up to 4th level N= 4, therefore the total amount of rounds KN = 2000 > Combining with generate.max.count = MaxPages we get maximum pages MaxPages * N. > Problem: the process should be smooth enough to guarantee the full list crawl for K rounds. Potential problems with crawling process and/or Hadoop cluster. > > 2) The second approach is to use hostdb https://www.mail-archive.com/user@.../msg14330.html[https://www.mail-archive.com/user@.../msg14330.html][https://www.mail-archive.com/user@.../msg14330.html[https://www.mail-archive.com/user@.../msg14330.html]] > Problem : that asks for additional computations for hostdb + workaround with the black list > > 3) My own solution, it is a bit tricky. > Using scoring-depth plugin extension and generate.min.score config. > > That plugin set up the weights of linked pages as ParrentWeight/Number of linked pages. The initial weight equals to 1 by default. > > My idea that we can estimate the maximum amount of page for the host. > To illustrate, there are several ways to get 1/4 weights for a host(5 pages, 5 pages and 7 pages). > > 1 > / / \ \ > / / \ \ > / / \ \ > 1/4 1/4 1/4 1/4 > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ > 1/4 1/4 > > 1 > / \ > / \ > / \ > 1/2 1/2 > / \ / \ > 1/4 1/4 1/4 1/4 > > The last tree gives maximum amount of pages with weight of 1/4( 3 levels each one sums up to 1). Total sum = 7. > The idea behind it is the maximum amount of links are given with the deepest tree.The deepest tree can be factorized on prime factors of the final weight. > > For example, for 1/4 we calculate the prime factors for 4 = 1 * 2 * 2, the total sum of pages equals to 1 + 1 * 2 + 1* 2* 2 = 7. > For weight of 1/9, 1 + 1 * 3 + 1*3*3 = 13 > For weight of 1/48, 1 + 1 *2 + 1*2*2 + 1*2*2*2 + 1*2*2*2*2 + 1*2*2*2*2*2*3 > > The calculator: http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22][http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22[http://www.calculator.net/factoring-calculator.html?cvar=18&x=77&y=22]] > Problem : the score can be affected by other plugins. > > Thanks. > > Semyon. > |
Free forum by Nabble | Edit this page |