[jira] Created: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
Move URLNormalizer from Outlink to ParseOutputFormat
----------------------------------------------------

                 Key: NUTCH-548
                 URL: https://issues.apache.org/jira/browse/NUTCH-548
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher
            Reporter: Emmanuel Joke
            Assignee: Emmanuel Joke
            Priority: Minor
             Fix For: 1.0.0
         Attachments: NUTCH-548.patch

The idea is to avoid instantiating a new URLNormalizer for every OutLink.
So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emmanuel Joke updated NUTCH-548:
--------------------------------

    Attachment: NUTCH-548.patch

Patch provided

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524669 ]

Emmanuel Joke commented on NUTCH-548:
-------------------------------------

Actually I've one comment/question. I noticed that we normalize and filter every links in ParseOutputFormat and then we do it again in CrawlDbFilter during the updateDb procedure. Is it really needed to do it twice or could we also remove this duplicate operation ?



> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524691 ]

Doğacan Güney commented on NUTCH-548:
-------------------------------------

We don't do it in CrawlDbFilter unless user specifically asks for it (by passing "-normalize" option). Also, CrawlDbFilter's normalization scope is different than ParseOutputFormat.

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12524761 ]

Emmanuel Joke commented on NUTCH-548:
-------------------------------------

Maybe I missed something but it seems we do it.

CrawlDb.update defined a JobConf which use CrawlDbFilter as Mapper.Class and set urlnomalizer and filter. The urlnormalizer and filter flags are pass by the configuration ( i supposed its when we set the plugin ).

Actually i find out while i was testing/debugging this patch, you can see it by running a simple crawl in debug mode in Eclipse and set a debug breakpoint on RegexURLNormalizer.regexNormalizer.

You point an interesting thing. Why do we have a scope ? I tried to check the code and it seems we never really use the scope defined in the function. Am i wrong ?

Beside if you look at the following codein regexNormalize:
  List curRules = (List)scopedRules.get(scope);
  if (curRules == null) {
     ......
     if (curRules == EMPTY_RULES || curRules == null) {
        LOG.warn("can't find rules for scope '" + scope + "', using default");
        scopedRules.put(scope, EMPTY_RULES);
      }
    }
    if (curRules == EMPTY_RULES || curRules == null) {
      // use global rules
      curRules = (List)scopedRules.get(URLNormalizers.SCOPE_DEFAULT);
    }
Why don't we directly set the scopeRules defined for the SCOPE_DEFAULT for every scope which has no rules instead of setting EMPTY_RULES and then getting the default rules ?

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525124 ]

Doğacan Güney commented on NUTCH-548:
-------------------------------------

> Maybe I missed something but it seems we do it.
>
>CrawlDb.update defined a JobConf which use CrawlDbFilter as Mapper.Class and set urlnomalizer and filter. The urlnormalizer
> and filter flags are pass by the configuration ( i supposed its when we set the plugin ).
>
> Actually i find out while i was testing/debugging this patch, you can see it by running a simple crawl in debug mode in Eclipse
> and set a debug breakpoint on RegexURLNormalizer.regexNormalizer.

OK, so I added this simple patch ( http://www.ceng.metu.edu.tr/~e1345172/print.patch ). And updatedb doesn't print anything unless I pass "-filter" or "-normalize" from command line. So, I don't think that we do it unless user asks for it.

> You point an interesting thing. Why do we have a scope ? I tried to check the code and it seems we never really use the scope
> defined in the function. Am i wrong ?

Scope is just an extra piece of information that may be used by plugins. A url normalizer plugin may want to treat a url different during invertlinks operation or an updatedb operation or whatever. I think it is not used by any plugins right now, but it doesn't hurt to keep it and it is potentially useful (btw, there is an ongoing issue to add scope to url filters too).

> Beside if you look at the following codein regexNormalize: [...]

I haven't looked at urlnormalizer-regex code in detail so I am not sure about this, but upon a first glance, I can say that setting-EMPTY_RULES-getting-default-rules part seems unnecessary.

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12525452 ]

Emmanuel Joke commented on NUTCH-548:
-------------------------------------

My mistake, you re right i was using the command crawl to make my test, and i didn't noticed that within the code it defined the urlfiter and urnormalizer to TRUE.

Anyway, this current patch is still valid and useful.

Thanks again for those explanation.

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Emmanuel Joke updated NUTCH-548:
--------------------------------

    Attachment: NUTCH-548.patch.v2

New patch which remove unused parameter and fix the plugin parser

This improvement has been open for a while, i'm wondering if somebody will commit it soon.

Thanks for your update

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch, NUTCH-548.patch.v2
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539133 ]

Doğacan Güney commented on NUTCH-548:
-------------------------------------

I think this is ready for commit, but I would like to get an approval from other (older) committers. Because normalization is Outlink's constructor may have a special purpose (as I mentioned before, it is possible that Outlink class is meant to stand on its own).

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch, NUTCH-548.patch.v2
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-548.
---------------------------------

    Resolution: Fixed

Since noone objected for a while, I am committing this one.

Committed in rev. 593186.

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch, NUTCH-548.patch.v2
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-548.
-------------------------------


> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch, NUTCH-548.patch.v2
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541231 ]

Hudson commented on NUTCH-548:
------------------------------

Integrated in Nutch-Nightly #261 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/261/])

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch, NUTCH-548.patch.v2
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-548) Move URLNormalizer from Outlink to ParseOutputFormat

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/NUTCH-548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541509 ]

Hudson commented on NUTCH-548:
------------------------------

Integrated in Nutch-Nightly #262 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/262/])

> Move URLNormalizer from Outlink to ParseOutputFormat
> ----------------------------------------------------
>
>                 Key: NUTCH-548
>                 URL: https://issues.apache.org/jira/browse/NUTCH-548
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>            Reporter: Emmanuel Joke
>            Assignee: Emmanuel Joke
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-548.patch, NUTCH-548.patch.v2
>
>
> The idea is to avoid instantiating a new URLNormalizer for every OutLink.
> So I move this operation to the ParseOutputFormat object.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.