[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (NUTCH-1932) Automatically remove orphaned pages

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124577#comment-16124577 ]

ASF GitHub Bot commented on NUTCH-1932:
---------------------------------------

sebastian-nagel opened a new pull request #211: NUTCH-1932 Automatically remove orphaned pages
URL: https://github.com/apache/nutch/pull/211
 
 
   - apply Markus Jelsma's latest patch, 2016-06-30
   - add method orphanedScore(Text, CrawlDatum) to ScoringFilter interface
   - complete unit tests for CrawlDb update
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


> Automatically remove orphaned pages
> -----------------------------------
>
>                 Key: NUTCH-1932
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1932
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>         Attachments: NUTCH-1932-add.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch, NUTCH-1932.patch
>
>
> Orphan scoring filter that determines whether a page has become orphaned, e.g. it has no more other pages linking to it. If a page hasn't been linked to after markGoneAfter seconds, the page is marked as gone and is then removed by an indexer.  If a page hasn't been linked to after markOrphanAfter seconds, the page is removed from the CrawlDB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)