[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347735#comment-16347735 ]

Markus Jelsma commented on NUTCH-2466:
--------------------------------------

Hello Moreno,

Well, we obviously could allow a -1 setting and treat that as forever, but forever is infinite and it would hang the Nutch task until Hadoop treats it as timed out, usually within ten minutes.

The setting is an int, so if you want, you can set it to the maximum positive integer and handle just over two billion consecutive redirects. Y

I believe that would justify the meaning of forever in this context, do you agree?

As a side note, having dealt with the crudeness of the www for many years, i consider any sequence of more than four redirects as the root a whole other problem. Our (company, not asf nutch) maximum setting is always three, higher than that has, so far, always lead to circular redirects.


> Sitemap processor to follow redirects
> -------------------------------------
>
>                 Key: NUTCH-2466
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2466
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.15
>
>         Attachments: NUTCH-2466.patch, NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)