[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-2368:
---------------------------------
    Attachment: NUTCH-2368.patch

Updated patch. Delay is not also set on minCrawlDelay to make it work if more than one thread works on the queue. The key is also temporarily set on every crawldatum but removed when passed to the fetch queue.

> Variable generate.max.count and fetcher.server.delay
> ----------------------------------------------------
>
>                 Key: NUTCH-2368
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2368
>             Project: Nutch
>          Issue Type: Improvement
>          Components: generator
>    Affects Versions: 1.12
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.13
>
>         Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining crawl speed and bulk sizes because with our (Openindex) settings we can just recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and providing powerful Jexl expressions. Check these two expressions added to the Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 800000) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 800000) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to fetch based on number of threads, 95th percentile response time of the fetch limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the fetch queue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)