[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUTCH-2496) Speed up link inversion step in crawling script

Tim Allison (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108478#comment-17108478 ]

ASF GitHub Bot commented on NUTCH-2496:

sebastian-nagel opened a new pull request #527:
URL: https://github.com/apache/nutch/pull/527

   - disable URL filtering and normalizing when calling invertlinks in bin/crawl
   - add note that the steps invertlinks, dedup, index could also be done outside the loop over all segments created in the loop iterations
   - move webgraph construction (commented out anyway) outside the loop because it's done over all available segments

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

> Speed up link inversion step in crawling script
> -----------------------------------------------
>                 Key: NUTCH-2496
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2496
>             Project: Nutch
>          Issue Type: Improvement
>          Components: linkdb
>    Affects Versions: 1.15
>            Reporter: Moreno Feltscher
>            Assignee: Lewis John McGibbney
>            Priority: Major
>             Fix For: 1.17
> While working on a project where I have to index a huge number of URLs I encountered an issue with the link inversion step of the crawling script. A while ago Ian Lopata stumbled upon the same issue as described here: http://lucene.472066.n3.nabble.com/InvertLinks-Performance-Nutch-1-6-td4183004.html
> {quote}
> I am running the invertlinks step in my Nutch 1.6 based crawl process on a
> single node.  I run invertlinks only because I need the Inlinks in the
> indexer step so as to store them with the document.  I do not need the
> anchor text and I am not scoring.  I am finding that invertlinks (and more
> specifically the merge of the linkdb) takes a long time - about 30 minutes
> for a crawl of around 150K documents.  I am looking for ways that I might
> shorten this processing time.  Any suggestions?
> {quote}
> Back then [~wastl-nagel] suggested turning off the normalizers and filters during the inversion step which speeds up the process a bunch.
> In my case however I kind of depend on those so this is no real solution.
> I opened this issue here in order to get some feedback on how we could improve things in a crawl script and speed up the process.

This message was sent by Atlassian Jira