[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

David Eric Pugh (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985987#comment-16985987 ]

ASF GitHub Bot commented on NUTCH-2748:
---------------------------------------

sebastian-nagel commented on pull request #485: NUTCH-2748 Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
URL: https://github.com/apache/nutch/pull/485
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-2748
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2748
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>         Attachments: test-NUTCH-2748.zip
>
>
> If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.
> There are two possible solutions:
> 1. ignore protocol status "redir_exceeded" during CrawlDb update
> 2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone
> Solution 2. seems easier to implement and it would be possible to make the behavior configurable:
> - store the redirect target as outlink, i.e. same behavior as if http.redirect.max == 0
> - store "fetch_gone" (current behavior)
> - store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone



--
This message was sent by Atlassian Jira
(v8.3.4#803005)