[jira] [Commented] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

Chris Mattmann (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063585#comment-17063585 ]

ASF GitHub Bot commented on NUTCH-2776:
---------------------------------------

sebastian-nagel commented on pull request #505: NUTCH-2776 Fetcher to temporarily deduplicate followed redirects
URL: https://github.com/apache/nutch/pull/505
 
 
   - cache followed redirect targets for a configurable time (`fetcher.redirect.dedupcache.seconds`)
   - if a redirect target is found in cache it's skipped
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Fetcher to temporarily deduplicate followed redirects
> -----------------------------------------------------
>
>                 Key: NUTCH-2776
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2776
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> If fetcher follows redirect (http.redirect.max > 0), it may happen that many redirects of a site point to the same URL. In this situation, it might be good if fetcher could temporarily (for a configurable time period) deduplicate the redirect targets and skip all redirects except the first one. Typical examples of duplicated redirect targets are:
> - instead of responding with HTTP status 404:
> {noformat}
> /
> /resource-not-found
> /search/
> /404
> /error/not-found
> /err/notfound.html{noformat}
> - a page to accept/decline cookies
> {noformat}
> /cookie_usage.php
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)