[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466261 ]

Ken Krugler commented on NUTCH-353:
-----------------------------------

Wait, looks like maybe change 490607 (fix for NUTCH-273) might fix the issue I just described in my previous comment. I don't think our latest public crawl was done with this patch.

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: https://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed.
> This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira