[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970155#comment-16970155 ]

Sebastian Nagel commented on NUTCH-2748:
----------------------------------------

Hi [~markus17], already working on a patch. Agreed, that current behavior is definitely wrong. I'll simplify my first solution. The second would only allow to treat exceeded redirects as links or skip them. Skipping may make sense when http.redirect.max is already set to a higher number (3 or more) or in a large crawl where you cannot trust the sites.

> Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2748
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2748
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>         Attachments: test-NUTCH-2748.zip
>
>
> If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.
> There are two possible solutions:
> 1. ignore protocol status "redir_exceeded" during CrawlDb update
> 2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone
> Solution 2. seems easier to implement and it would be possible to make the behavior configurable:
> - store the redirect target as outlink, i.e. same behavior as if http.redirect.max == 0
> - store "fetch_gone" (current behavior)
> - store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone



--
This message was sent by Atlassian Jira
(v8.3.4#803005)