[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466260 ]

Ken Krugler commented on NUTCH-353:

Another small note about this (see NUTCH-411 for a related but different problem) ...

If a page (e.g. http://boutell.com) returns a meta refresh header (e.g. <meta http-equiv="refresh" content="0;url=http://www.boutell.com/">), and you also wind up fetching the target page independently, then it looks like you can wind up with both pages in the crawl results. One entry has a title like "boutell.com", while the other has the real page title. Or at least I've seen this a few times in our crawl results.

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>                 Key: NUTCH-353
>                 URL: https://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki
>            Priority: Blocker
>             Fix For: 0.9.0
>         Attachments: doNotRefecthForwarderPagesV1.patch
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed.
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira