[jira] Created: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Markus Jelsma (Jira)
Avoid cloningCrawlDatum in CrawlDbReducer
------------------------------------------

                 Key: NUTCH-761
                 URL: https://issues.apache.org/jira/browse/NUTCH-761
             Project: Nutch
          Issue Type: Improvement
            Reporter: Julien Nioche
            Priority: Minor


In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger,  we noticed an improvement of around 25-30% in the time spent in the reduce phase.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-761:
--------------------------------

    Attachment: optiCrawlReducer.patch

> Avoid cloningCrawlDatum in CrawlDbReducer
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger,  we noticed an improvement of around 25-30% in the time spent in the reduce phase.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-761.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.1
         Assignee: Andrzej Bialecki

> Avoid cloningCrawlDatum in CrawlDbReducer
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger,  we noticed an improvement of around 25-30% in the time spent in the reduce phase.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782537#action_12782537 ]

Andrzej Bialecki  commented on NUTCH-761:
-----------------------------------------

I applied the patch with some changes - reverted the logic in the name of the boolean var, and applied the same method to other cases of non-multiple values. Committed in rev. 884224 - thanks!

> Avoid cloningCrawlDatum in CrawlDbReducer
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger,  we noticed an improvement of around 25-30% in the time spent in the reduce phase.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-761) Avoid cloningCrawlDatum in CrawlDbReducer

Markus Jelsma (Jira)
In reply to this post by Markus Jelsma (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783239#action_12783239 ]

Hudson commented on NUTCH-761:
------------------------------

Integrated in Nutch-trunk #995 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/995/])
    Fix a bug resulting from over-eager optimization in .
 Avoid cloning CrawlDatum in CrawlDbReducer.


> Avoid cloningCrawlDatum in CrawlDbReducer
> ------------------------------------------
>
>                 Key: NUTCH-761
>                 URL: https://issues.apache.org/jira/browse/NUTCH-761
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Julien Nioche
>            Assignee: Andrzej Bialecki
>            Priority: Minor
>             Fix For: 1.1
>
>         Attachments: optiCrawlReducer.patch
>
>
> In the huge majority of cases the CrawlDbReducer gets unique CrawlData in its reduce phase and these will be the entries coming from the crawlDB and not present in the segments.
> The patch attached optimizes the reduce step by avoid an unnecessary cloning of the CrawlDatum fields when there is only one CrawlDatum in the values. This has more impact has the crawlDB gets larger,  we noticed an improvement of around 25-30% in the time spent in the reduce phase.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.