[jira] Created: (NUTCH-382) Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-382) Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled

JIRA jira@apache.org
Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
----------------------------------------------------------------------------

                 Key: NUTCH-382
                 URL: http://issues.apache.org/jira/browse/NUTCH-382
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.9.0
            Reporter: Jim Kellerman


The fix for NUTCH-365 in org.apache.nutch.crawl.Generator.java (revision 449088) introduced a bug in which if generate.max.per.host.by.ip is enabled, the error message "WARN  crawl.Generator (Generator.java:reduce(181)) - Malformed URL: '38.99.15.82', skipping". The message varies according to the host IP.

This is because the hostname has already been converted to its IP address, and the code:

              host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);

will not normalize an IP address. What is needed to fix this this problem is to include the code inserted in revision 449088 inside an else block so that this code is not executed if generate.max.per.host.by.ip is enabled.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-382) Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled

JIRA jira@apache.org
     [ http://issues.apache.org/jira/browse/NUTCH-382?page=all ]

Jim Kellerman updated NUTCH-382:
--------------------------------

    Attachment: patch.txt

Patch to fix this issue.

> Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-382
>                 URL: http://issues.apache.org/jira/browse/NUTCH-382
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>            Reporter: Jim Kellerman
>         Attachments: patch.txt
>
>
> The fix for NUTCH-365 in org.apache.nutch.crawl.Generator.java (revision 449088) introduced a bug in which if generate.max.per.host.by.ip is enabled, the error message "WARN  crawl.Generator (Generator.java:reduce(181)) - Malformed URL: '38.99.15.82', skipping". The message varies according to the host IP.
> This is because the hostname has already been converted to its IP address, and the code:
>               host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> will not normalize an IP address. What is needed to fix this this problem is to include the code inserted in revision 449088 inside an else block so that this code is not executed if generate.max.per.host.by.ip is enabled.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-382) Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrzej Bialecki  closed NUTCH-382.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

> Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-382
>                 URL: https://issues.apache.org/jira/browse/NUTCH-382
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>            Reporter: Jim Kellerman
>             Fix For: 1.0.0
>
>         Attachments: patch.txt
>
>
> The fix for NUTCH-365 in org.apache.nutch.crawl.Generator.java (revision 449088) introduced a bug in which if generate.max.per.host.by.ip is enabled, the error message "WARN  crawl.Generator (Generator.java:reduce(181)) - Malformed URL: '38.99.15.82', skipping". The message varies according to the host IP.
> This is because the hostname has already been converted to its IP address, and the code:
>               host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> will not normalize an IP address. What is needed to fix this this problem is to include the code inserted in revision 449088 inside an else block so that this code is not executed if generate.max.per.host.by.ip is enabled.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-382) Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566117#action_12566117 ]

Andrzej Bialecki  commented on NUTCH-382:
-----------------------------------------

This has been fixed as a part of another commit.

> Fix for NUTCH-365 introduced a bug if generate.max.per.host.by.ip is enabled
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-382
>                 URL: https://issues.apache.org/jira/browse/NUTCH-382
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.9.0
>            Reporter: Jim Kellerman
>             Fix For: 1.0.0
>
>         Attachments: patch.txt
>
>
> The fix for NUTCH-365 in org.apache.nutch.crawl.Generator.java (revision 449088) introduced a bug in which if generate.max.per.host.by.ip is enabled, the error message "WARN  crawl.Generator (Generator.java:reduce(181)) - Malformed URL: '38.99.15.82', skipping". The message varies according to the host IP.
> This is because the hostname has already been converted to its IP address, and the code:
>               host = normalizers.normalize(host, URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
> will not normalize an IP address. What is needed to fix this this problem is to include the code inserted in revision 449088 inside an else block so that this code is not executed if generate.max.per.host.by.ip is enabled.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.