[jira] Created: (NUTCH-606) Refactoring of Generator, run all urls through checks

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
Refactoring of Generator, run all urls through checks
-----------------------------------------------------

                 Key: NUTCH-606
                 URL: https://issues.apache.org/jira/browse/NUTCH-606
             Project: Nutch
          Issue Type: Bug
          Components: generator
         Environment: all
            Reporter: Dennis Kubes
            Priority: Minor
             Fix For: 1.0.0


Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Assigned: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes reassigned NUTCH-606:
----------------------------------

    Assignee: Dennis Kubes

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-606:
-------------------------------

    Attachment: NUTCH-606-1-20080208.patch

Refactors the generator and ensures the checks are run on all urls the could be collected and not just if generate.max.per.host is > 0 (i.e. not default)

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-606:
-------------------------------

    Attachment: NUTCH-606-2-20080208.patch

Adds some refactoring to close file readers before exiting if no urls have been fetched.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567250#action_12567250 ]

Andrzej Bialecki  commented on NUTCH-606:
-----------------------------------------

+1. A minor issue: I don't think URL.getHost() can return a null value - even for URLs with unspecified host name it returns an empty non-null String.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-606:
-------------------------------

    Attachment: NUTCH-606-3-20080208.patch

Added an empty check for hostnames

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567288#action_12567288 ]

Andrzej Bialecki  commented on NUTCH-606:
-----------------------------------------

I'm sorry, I should have been clearer ... My point was that it's not necessary to check for null host names, because AFAIK URL.getHost() never returns null. On the other hand, there are legitimate situations when it can return an empty string, so this check that you added in patch v. 3 is in fact harmful. E.g. it would filter out all "file:///" urls.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-606:
-------------------------------

    Attachment: NUTCH-606-4-20080209.patch

Yup, did some simple tests and any null or empty urls will be filtered out on creating the url and removing empty hosts will filter out root paths.  Removed the checks for null and empty hosts.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch, NUTCH-606-4-20080209.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567346#action_12567346 ]

Andrzej Bialecki  commented on NUTCH-606:
-----------------------------------------

+1, looks great now.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch, NUTCH-606-4-20080209.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567793#action_12567793 ]

Dennis Kubes commented on NUTCH-606:
------------------------------------

If nobody has any objections I will go ahead and commit this tonight.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch, NUTCH-606-4-20080209.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-606.
--------------------------------

    Resolution: Fixed

Committed.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch, NUTCH-606-4-20080209.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-606) Refactoring of Generator, run all urls through checks

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12568423#action_12568423 ]

Hudson commented on NUTCH-606:
------------------------------

Integrated in Nutch-trunk #360 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/360/])

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, NUTCH-606-3-20080208.patch, NUTCH-606-4-20080209.patch
>
>
> Refactor the generator to make sure all host run through checks such as host and protocol checks, ip checks if necessary.  Currently the generator only does this for urls if generate.max.per.host > 0 which by default is -1.  So by default all urls will get collected without checks.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.