[jira] Created: (NUTCH-524) Generate Problem with Single Node

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
Generate Problem with Single Node
---------------------------------

                 Key: NUTCH-524
                 URL: https://issues.apache.org/jira/browse/NUTCH-524
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.9.0
         Environment: All
            Reporter: Daniel Clark
            Priority: Minor
             Fix For: 0.9.0


Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Clark updated NUTCH-524:
-------------------------------

    Attachment: nutch-0.9_PartitionUrlByHost.patch

This is the patch for this problem.

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514874 ]

Doğacan Güney commented on NUTCH-524:
-------------------------------------

If you are fetching N urls from a single host, then you should fetch all N urls from a single machine, no matter how many machines you have. This is necessary for web politeness (your fetcher should at most keep 1 connection open to a server at any time).

PS: You patch unnecessarily removes and re-adds the entire file even though it is actually just changing a single line. In the future, please do not attach a page that touches lines it doesn't change.

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515041 ]

Ian Holsman commented on NUTCH-524:
-----------------------------------

Hi Dogacan.

we need this setting as we have the situation where we have a single host which has millions of URLs/files, and it is impossible for a single machine to crawl it in a adequate amount of time.

In this case web politeness isn't an issue, as we also own the site in question, and we know it can handle the load

We thought that other large sites might also run into this issue, so we might it into a config option

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515259 ]

Doğacan Güney commented on NUTCH-524:
-------------------------------------

Have you tried playing with max.threads.per.host option instead? If you set it to >1, a fetcher can open more than 1 connection to a machine. So, setting max.threads.per.host equal to number of machines you have may provide the same effect.

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525419 ]

Doğacan Güney commented on NUTCH-524:
-------------------------------------

Hi Ian and Daniel,

Have you tried max.threads.per.host option? Or are you still working on this one?

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-524.
-----------------------------------

    Resolution: Won't Fix

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-524) Generate Problem with Single Node

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633373#action_12633373 ]

Andrzej Bialecki  commented on NUTCH-524:
-----------------------------------------

Closing this issue because the requested change is not generally applicable.

> Generate Problem with Single Node
> ---------------------------------
>
>                 Key: NUTCH-524
>                 URL: https://issues.apache.org/jira/browse/NUTCH-524
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>         Environment: All
>            Reporter: Daniel Clark
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: nutch-0.9_PartitionUrlByHost.patch
>
>
> Nutch with Hadoop has problems with a single node in URL list when there is a cluster of two or more machines.  I will provide a fix for this.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.