[jira] [Resolved] (NUTCH-2184) Enable IndexingJob to function with no crawldb

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Resolved] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Tim Allison (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel resolved NUTCH-2184.
    Resolution: Fixed

Merged PR #486 into master. Thanks, [~lewismc] for the initial work! I'll close the obsolete PR #95.

> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.17
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
> Sometimes when working with distributed team(s), we have found that we can 'loose' data structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying crawldb or linkdb.
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] crawldb is mandatory.
> This ticket should enhance the IndexerMapReduce code to support the use case where you ONLY have segments and want to force an index for every record present.

This message was sent by Atlassian Jira