[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284 ]

Doug Cook commented on NUTCH-353:

I have a local fix for this problem (partly Paul Gauthier's work, partly mine) that I have been testing for some time. It's a little bit of a hack, but it's much better than just indexing the redirect target (which is the wrong behavior in many instances; see comments earlier).

The fix is to index both instances of the page, both the source and the target, making sure that the outlinks from the target page are only assigned to the target page. This way, in the (frequent) case that the redirect *source* is the canonical version of the page, with more anchor text, it will show up for searches. The fix seems to work pretty well, and solves a significant percentage of Nutch's "missing home pages" problem without using much extra space in the index. If it sounds useful to anyone, I'm happy to contribute it back.


> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>                 Key: NUTCH-353
>                 URL: https://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki
>            Priority: Blocker
>             Fix For: 0.9.0
>         Attachments: doNotRefecthForwarderPagesV1.patch
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed.
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira