[jira] Created: (NUTCH-530) Add a combiner to improve performance on updatedb

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
Add a combiner to improve performance on updatedb
-------------------------------------------------

                 Key: NUTCH-530
                 URL: https://issues.apache.org/jira/browse/NUTCH-530
             Project: Nutch
          Issue Type: Improvement
         Environment: java 1.6
            Reporter: Emmanuel Joke
            Assignee: Emmanuel Joke
             Fix For: 1.0.0


We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.

We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emmanuel Joke updated NUTCH-530:
--------------------------------

    Attachment: NUTCH-530.patch

Patch provided.

It reduced the process time by 20%.

Output from the task:
Map output records=98317
Map input bytes=10907058
Map output bytes=10021579
Combine input records=98317
Combine output records=42390
Reduce input groups=28601
Reduce input records=43005
Reduce output records=28601

I can see a real improvement.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516357 ]

Doğacan Güney commented on NUTCH-530:
-------------------------------------

Ehm, I am not sure about this... After this, we call updateDbScore twice, right? Once to 'merge' linked's together, once to pass big-merged-linked to old datum. This changes ScoringFilter's semantics and may not work for ScoringFilters if one is, say, using the number of outlinks as a factor in scoring.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516602 ]

Emmanuel Joke commented on NUTCH-530:
-------------------------------------

I'm sure to follow your point regarding the outlinks number.

I don't think its relevant to take into account the number of inlinks. A url can have inlink from different segments. If we really want to do it, it means that we will have to update the db using all segments in one update. So far, the updateDb is done only on a single segment.


> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516621 ]

Doğacan Güney commented on NUTCH-530:
-------------------------------------

Yeah, you are right.

+1 from me.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516673 ]

Andrzej Bialecki  commented on NUTCH-530:
-----------------------------------------

-1 from me.

See the recent discussion on Hadoop-dev - combiners simply re-use the same API as Reducer, but they follow a different semantics. The contract for a Combiner is that it could be run several times on the same data, so it should not have side effects on the data beyond mere aggregation of values. In our case, since we would re-use CrawlDbReducer as a combiner, we would do much more than a simple aggregation. Dogacan is right that ScoringFilters would be run twice, which may produce strange results. In addition to that, not all values may be present when a Combiner is run - combiners are run in the context of the current spill, which may not include all matching values even from the same input file.

Additionally, updatedb can be run with multiple segments - see the synopsis in CrawlDb.run().

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516675 ]

Emmanuel Joke commented on NUTCH-530:
-------------------------------------

Actually I don't re-use CrawlDbReducer, I've define a new class as Combiner. This class aggregates only the score of all CrawlDatum with the status "Linked" into one CrawlDatum. Its just a part of what CrawlDbReducer do. I've done few test in different case and it has no impact on the current score.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525418 ]

Doğacan Güney commented on NUTCH-530:
-------------------------------------

Andrzej, what do you think about this one in light of Emmanuel's last comment? I am still uneasy about ScoringFilters running twice,  but I think Emmanuel is right that semantics don't change.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525475 ]

Andrzej Bialecki  commented on NUTCH-530:
-----------------------------------------

I'm still against this patch, exactly because we are not sure how many times the ScoringFilters will be executed - it may be once, twice or N times. The current contract for ScoringFilters is that they are executed once.

CrawlDbReducer itself does not reduce all inlinked datums to a single CrawlDatum - it's up to the scoring filters to do whatever they want to do with all inlinks - although it's true that scoring-opic performs an operation equivalent to this, this may not always be the case.

Second, let's consider the following scenario (BTW, this is close to one of the ScoringFilters that I actually implemented, so it's not far fetched): let's say I implemented a ScoringFilter that checks for existence of a flag in CrawlDatum (presumably put there by Generator), and based on the value of this flag it counts the score from inlinks differently. Then it clears the flag to mark a successful update. If we ran updatedb that includes your patch, this operation would work ok in the first spill from the Combiner (although with vastly incomplete information), and then it would fail to do the right thing on subsequent runs through the Combiner or Reducer, because the flag would be already reset.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12578961#action_12578961 ]

Andrzej Bialecki  commented on NUTCH-530:
-----------------------------------------

If there are no new arguments for/against, in the light of my last comment I'd like to close this issue as Won't Fix.

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12579285#action_12579285 ]

Emmanuel Joke commented on NUTCH-530:
-------------------------------------

OK

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-530) Add a combiner to improve performance on updatedb

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-530.
-----------------------------------

    Resolution: Won't Fix

> Add a combiner to improve performance on updatedb
> -------------------------------------------------
>
>                 Key: NUTCH-530
>                 URL: https://issues.apache.org/jira/browse/NUTCH-530
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: java 1.6
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-530.patch
>
>
> We have a lot of similar links with status "linked" generated at the ouput of the map task when we try to update the crawldb based on the segment fetched.
> We can use a combiner to improve the performance.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.