[jira] [Updated] (NUTCH-2230) Nutch doesn't index all URLs found

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Updated] (NUTCH-2230) Nutch doesn't index all URLs found

Cristian Vat (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-2230:
    Fix Version/s: 2.5

> Nutch doesn't index all URLs found
> ----------------------------------
>                 Key: NUTCH-2230
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2230
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 2.3.1
>         Environment: MongoDB with WiredTiger storage engine (3.2 but probably affects other versions as well)
>            Reporter: Aaron Cosand
>            Priority: Major
>             Fix For: 2.5
> The initial query run by the generator task, against mongodb, doesn't force ordering by _id.  This causes an incorrect selection of ranges for successive map-reduce related queries.  The successive queries do appear to be getting run in the correct order since _id is always indexed, but they should also explicitly specify a sort, since you are not guaranteed a particular order otherwise.  I didn't dig deep enough to see if the root of the problem is with nutch or gora, and whether it only affected mongo or could affect other databases as well.

This message was sent by Atlassian Jira