[jira] [Created] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (NUTCH-2776) Fetcher to temporarily deduplicate followed redirects

Chris Mattmann (Jira)
Sebastian Nagel created NUTCH-2776:
--------------------------------------

             Summary: Fetcher to temporarily deduplicate followed redirects
                 Key: NUTCH-2776
                 URL: https://issues.apache.org/jira/browse/NUTCH-2776
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17


If fetcher follows redirect (http.redirect.max > 0), it may happen that many redirects of a site point to the same URL. In this situation, it might be good if fetcher could temporarily (for a configurable time period) deduplicate the redirect targets and skip all redirects except the first one. Typical examples of duplicated redirect targets are:
- instead of responding with HTTP status 404:
{noformat}
/
/resource-not-found
/search/
/404
/error/not-found
/err/notfound.html{noformat}
- a page to accept/decline cookies
{noformat}
/cookie_usage.php
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)