[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011746#comment-17011746 ]

ASF GitHub Bot commented on NUTCH-2184:

sebastian-nagel commented on issue #95: NUTCH-2184 Enable IndexingJob to function with no crawldb
URL: https://github.com/apache/nutch/pull/95#issuecomment-572532933
   Closed in favor of #486
   - indexing without a CrawlDb record has already been implemented in NUTCH-2456/#240
   - various improvements from this PR have been integrated in #486
   - separation of mapper and reducer classes is part of NUTCH-2375/#221
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[hidden email]

> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.17
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
> Sometimes when working with distributed team(s), we have found that we can 'loose' data structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying crawldb or linkdb.
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java] crawldb is mandatory.
> This ticket should enhance the IndexerMapReduce code to support the use case where you ONLY have segments and want to force an index for every record present.

This message was sent by Atlassian Jira