[jira] [Assigned] (NUTCH-2496) Speed up link inversion step in crawling script

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Assigned] (NUTCH-2496) Speed up link inversion step in crawling script

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Moreno Feltscher reassigned NUTCH-2496:

    Assignee: Lewis John McGibbney

> Speed up link inversion step in crawling script
> -----------------------------------------------
>                 Key: NUTCH-2496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2496
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
> While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a
> single node.  I run invertlinks only because I need the Inlinks in the
> indexer step so as to store them with the document.  I do not need the
> anchor text and I am not scoring.  I am finding that invertlinks (and more
> specifically the merge of the linkdb) takes a long time - about 30 minutes
> for a crawl of around 150K documents.  I am looking for ways that I might
> shorten this processing time.  Any suggestions?
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process.

This message was sent by Atlassian JIRA