[jira] [Created] (NUTCH-2408) CrawlDb: allow update from unparsed segments

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

[jira] [Created] (NUTCH-2408) CrawlDb: allow update from unparsed segments

JIRA jira@apache.org
Sebastian Nagel created NUTCH-2408:

             Summary: CrawlDb: allow update from unparsed segments
                 Key: NUTCH-2408
                 URL: https://issues.apache.org/jira/browse/NUTCH-2408
             Project: Nutch
          Issue Type: Improvement
          Components: crawldb
    Affects Versions: 1.13
            Reporter: Sebastian Nagel
            Priority: Minor
             Fix For: 1.14

The command updatedb (class o.a.n.crawl.CrawlDb) does not allow to update the CrawlDb with fetch status only (from segment subdirectory crawl_fetch) without also reading crawl_parse (which contains outlinks but also scores, signatures and meta data).

A workflow which does not require parsing of documents (e.g., because raw HTML content is exported to WARC files) is then unable to update the CrawlDb to store the fetch status.

This message was sent by Atlassian JIRA