[jira] [Comment Edited] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Comment Edited] (NUTCH-2455) Speed up the merging of HostDb entries for variable fetch delay

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16278820#comment-16278820 ]

Semyon Semyonov edited comment on NUTCH-2455 at 12/5/17 4:31 PM:
-----------------------------------------------------------------

[~wastl-nagel]
I have started to work on this issue, and face some problems with combination of host and score.

You proposed
the map function then emits key-value pairs <host, score> -> <url,crawldatum,score,...>
of course, the HostDatums must be wrapped into the value structure. It's already a custom class (SelectorEntry), so that should be doable
via partitioning and secondary sorting these arrive in the reduce function:
all keys with the same host in one call of the function
in the following order: first the HostDatum (just assign an artificially high score), then the CrawlDatum items sorted by decreasing score

In the code,
 limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(), in reduce acts as following:
       if (count == limit) {
          // do we have any segments left?
          if (currentsegmentnum < maxNumSegments) {
            count = 0;
            currentsegmentnum++;
          } else
            break;
        }

For each key in the reducer, where the key is a sorted score. Therefore the reducer takes TOPN scored urls across all hosts.

With the proposed approach it doesnt work anymore, because the data is started to be host based sorted( all keys with the same host in one call of the function).

For example, bbc.com(300 pages) and amazon.com(200 pages).
With topN = 70.
Now it works as follows :
1 - call for weight - 1.  20 pages from bbc.com + 10 pages from amazon.com
2-  call for weight - 0.5 .  5 pages from bbc.com +35 pages from amazon.com.

If we introduce "one call for the hostdb system" it will be
1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.

I'm thinking about the alternative solution:
1) Use a composite key (score, host). As a value we use SelectorEntry and add hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, from the second mapper only crawldbdata.

Therefore, the combined output from two mappers can look like this:
(1, bbc.com) - (crawl, null)
(1, bbc.com) - (crawl,null)
(0.5, bbc.com) - (crawl,null)
(null, bbc.com) - (null, hostdb)

host is a partitioner key(or domain/ip, as it works now).

2) Implement SortComparatorClass.
If score == null, return 1, therefore all keys with score == null goes to the top.

3)(Optionally) use grouping comparator combine all keys with score == null, to one.

After these step one the top we should have the hostdb data for all keys for the reducer, therefore first check it and load to the memory. Afterwards we just follow natural order with score and check the hostdb restriction.

What do you think about this way?


was (Author: [hidden email]):
[~wastl-nagel]
I have started to work on this issue, and face some problems with combination of host and score.

You proposed
??the map function then emits key-value pairs <host, score> -> <url,crawldatum,score,...>
of course, the HostDatums must be wrapped into the value structure. It's already a custom class (SelectorEntry), so that should be doable
via partitioning and secondary sorting these arrive in the reduce function:
all keys with the same host in one call of the function
in the following order: first the HostDatum (just assign an artificially high score), then the CrawlDatum items sorted by decreasing score??

In the code,
?? limit = job.getLong(GENERATOR_TOP_N, Long.MAX_VALUE) / job.getNumReduceTasks(), in reduce acts as following:
       if (count == limit) {
          // do we have any segments left?
          if (currentsegmentnum < maxNumSegments) {
            count = 0;
            currentsegmentnum++;
          } else
            break;
        }??

For each key in the reducer, where the key is a sorted score. Therefore the reducer takes TOPN scored urls across all hosts.

With the proposed approach it doesnt work anymore, because the data is started to be host based sorted(?? all keys with the same host in one call of the function??).

For example, bbc.com(300 pages) and amazon.com(200 pages).
With topN = 70.
Now it works as follows :
*1 - call for weight - 1.  20 pages from bbc.com + 10 pages from amazon.com
2-  call for weight - 0.5 .  5 pages from bbc.com +35 pages from amazon.com.*

If we introduce "one call for the hostdb system" it will be
*1 - call for bbc.com - 70 pages from bbc.com , 0 from amazon.com.*

I'm thinking about the alternative solution:
1) Use a composite key (score, host). As a value we use SelectorEntry and add hostdatum there. From the first mapper(hostdb reader) we get only hostdb data, from the second mapper only crawldbdata.

Therefore, the combined output from two mappers can look like this:
*(1, bbc.com) - (crawl, null)
(1, bbc.com) - (crawl,null)
(0.5, bbc.com) - (crawl,null)
(null, bbc.com) - (null, hostdb)*

host is a partitioner key(or domain/ip, as it works now).

2) Implement SortComparatorClass.
If score == null, return 1, therefore all keys with score == null goes to the top.

3)(Optionally) use grouping comparator combine all keys with score == null, to one.

After these step one the top we should have the hostdb data for all keys for the reducer, therefore first check it and load to the memory. Afterwards we just follow natural order with score and check the hostdb restriction.

What do you think about this way?

> Speed up the merging of HostDb entries for variable fetch delay
> ---------------------------------------------------------------
>
>                 Key: NUTCH-2455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2455
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>         Attachments: NUTCH-2455.patch
>
>
> Citing Sebastian at NUTCH-2420:
> ??The correct solution would be to use <host,score> pairs as keys in the Selector job, with a partitioner and secondary sorting so that all keys with same host end up in the same call of the reducer. If values can also hold a HostDb entry and the sort comparator guarantees that the HostDb entry (entries if partitioned by domain or IP) comes in front of all CrawlDb entries. But that would be a substantial improvement...??



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)