[jira] Created: (NUTCH-603) Add more default url normalizations

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
Add more default url normalizations
-----------------------------------

                 Key: NUTCH-603
                 URL: https://issues.apache.org/jira/browse/NUTCH-603
             Project: Nutch
          Issue Type: Improvement
         Environment: All
            Reporter: Dennis Kubes
            Assignee: Dennis Kubes
             Fix For: 1.0.0


By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-603:
-------------------------------

    Attachment: NUTCH-603-1-20080205.patch

Added normalizations for removing different session ids, for changing default pages such as index.html to /, for removing #something interpage anchors, and for cleaning up urls such as multiple ampersands, ending ?, ., or & characters.  Unit tests were added to show results of expressions.  All current expressions were tuned for performance.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567797#action_12567797 ]

Dennis Kubes commented on NUTCH-603:
------------------------------------

If nobody has any objections I will go ahead and commit this tonight.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567860#action_12567860 ]

Andrzej Bialecki  commented on NUTCH-603:
-----------------------------------------

I'm of a split mind towards one of these rules: the one that strips /index.html and similar. I know of a few sites where /index.html != /index.php, I even remember creating one like that :) Some sites redirect / not to /index.html but somewhere down in the hierarchy, and they don't have any proper /index.html at all. In other words, I vote for removing this rule, or at least commenting it out.

Other than that, +1.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567958#action_12567958 ]

Dennis Kubes commented on NUTCH-603:
------------------------------------

I am ok with commenting it out.  As long as it is there for people to use (instead of having to create) I think it will be ok.  I will comment that out and if no objections will commit that version.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes updated NUTCH-603:
-------------------------------

    Attachment: NUTCH-603-2-20080212.patch

This patch comments out the default page removal (i.e. index.html) and adds the _ character to be removed if attached to session ids (i.e. _sessionid)

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch, NUTCH-603-2-20080212.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dennis Kubes resolved NUTCH-603.
--------------------------------

    Resolution: Fixed

Committed.

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch, NUTCH-603-2-20080212.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-603) Add more default url normalizations

ASF GitHub Bot (Jira)
In reply to this post by ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12569175#action_12569175 ]

Hudson commented on NUTCH-603:
------------------------------

Integrated in Nutch-trunk #362 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/362/])

> Add more default url normalizations
> -----------------------------------
>
>                 Key: NUTCH-603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-603
>             Project: Nutch
>          Issue Type: Improvement
>         Environment: All
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-603-1-20080205.patch, NUTCH-603-2-20080212.patch
>
>
> By default the regex-urlnormalizers only remove PHPSESSID strings.  I propose adding in more default url normalizers including expressions for removing different types of session ids, removing default pages, remvoing interpage links, and cleaning up url strings.  The point of these expressions is to decrease the number of duplicate urls that are being stored and scored in the crawl database and being fetched.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.