[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2748) Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16970107#comment-16970107 ]

Sebastian Nagel commented on NUTCH-2748:

Attached sample CrawlDb and segment to reproduce the problem: if the CrawlDb is updated using the given segment, the "db_fetched" item ([https://en.wikipedia.org/wiki/URL_redirection] )will change its state to db_gone because its referenced at the end of a redirect chain:
 -> https://www.wikipedia.com/wiki/URL_redirection
    -> https://www.wikipedia.org/wiki/URL_redirection
       -> https://en.wikipedia.org/wiki/URL_redirection

> Fetch status gone (redirect exceeded) not to overwrite existing items in CrawlDb
> --------------------------------------------------------------------------------
>                 Key: NUTCH-2748
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2748
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, fetcher
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>         Attachments: test-NUTCH-2748.zip
> If fetcher is following redirects and the max. number of redirects in a redirect chain (http.max.redirect) is reached, fetcher stores a CrawlDatum item with status "fetch_gone" and protocol status "redir_exceeded". During the next CrawlDb update the "gone" item will set the status of existing items (including "db_fetched") with "db_gone". It shouldn't as there has been no fetch of the final redirect target and indeed nothing is know about it's status. An wrong db_gone may then cause that a page gets deleted from the search index.
> There are two possible solutions:
> 1. ignore protocol status "redir_exceeded" during CrawlDb update
> 2. when http.redirect.max is hit the fetcher stores nothing or a redirect status instead of a fetch_gone
> Solution 2. seems easier to implement and it would be possible to make the behavior configurable:
> - store the redirect target as outlink, i.e. same behavior as if http.redirect.max == 0
> - store "fetch_gone" (current behavior)
> - store nothing, i.e. ignore those redirects - this should be the default as it's close to the current behavior without the risk to accidentally set successful fetches to db_gone

This message was sent by Atlassian Jira