[jira] Created: (NUTCH-366) Merge URLFilters and URLNormalizers

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-366) Merge URLFilters and URLNormalizers

JIRA jira@apache.org
Merge URLFilters and URLNormalizers
-----------------------------------

                 Key: NUTCH-366
                 URL: http://issues.apache.org/jira/browse/NUTCH-366
             Project: Nutch
          Issue Type: Improvement
            Reporter: Andrzej Bialecki


Currently Nutch uses two subsystems related to url validation and normalization:

* URLFilter: this interface checks if URLs are valid for further processing. Input URL is not changed in any way. The output is a boolean value.

* URLNormalizer: this interface brings URLs to their base ("normal") form, or removes unneeded URL components, or performs any other URL mangling as necessary. Input URLs are changed, and are returned as result.

However, various Nutch tools run filters and normalizers in pre-determined order, i.e. normalizers first, and then filters. In some cases, where normalizers are complex and running them is costly (e.g. numerous regex rules, DNS lookups) it would make sense to run some of the filters first (e.g. prefix-based filters that select only certain protocols, or suffix-based filters that select only known "extensions"). This is currently not possible - we always have to run normalizers, only to later throw away urls because they failed to pass through filters.

I would like to solicit comments on the following two solutions, and work on implementation of one of them:

1) we could make URLFilters and URLNormalizers implement the same interface, and basically make them interchangeable. This way users could configure their order arbitrarily, even mixing filters and normalizers out of order. This is more complicated, but gives much more flexibility - and NUTCH-365 already provides sufficient framework to implement this, including the ability to define different sequences for different steps in the workflow.

2) we could use a property "url.mangling.order" ;) to define whether normalizers or filters should run first. This is simple to implement, but provides only limited improvement - because either all filters or all normalizers would run, they couldn't be mixed in arbitrary order.

Any comments?

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Created: (NUTCH-366) Merge URLFilters and URLNormalizers

Federico Dal Maso
>
>
> I would like to solicit comments on the following two solutions, and work
> on implementation of one of them:
>
> 1) we could make URLFilters and URLNormalizers implement the same
> interface, and basically make them interchangeable. This way users could
> configure their order arbitrarily, even mixing filters and normalizers out
> of order. This is more complicated, but gives much more flexibility - and
> NUTCH-365 already provides sufficient framework to implement this, including
> the ability to define different sequences for different steps in the
> workflow.
>
>

I suggest (and vote for) a extended solution 1

I suggest to create a new class, maybe URLTransformation or URLAnalyzer.
It's an interface:

public interface URLAnalyzer {
   public String analyze(String url)
}

URLFilter should be refactored in this way (to preserve the semantic of the
"pass-or-not-pass" filter):

public abstract URLFilter implements URLAnalyzer {
   public abstract boolean filter(String url);

   public String analyze(url) {
      if(filter(url))
         return url;
      else
         return null;
   }
}

URLNormalizer changes similarly

public abstract URLNormalizer implements URLAnalyzer {
   public abstract String normalize(String url);

   public String analyze(url) {
      return normalize(url);
   }
}


Pro:
- The URLAnalyzer_s can be run transparently in any order by a "chain
algorithm", that ignore the dynamic type (filter or normalizer or other...)
of current handled instance of URLAnalyzer. The chainer must consider that a
null return from an analyzer means "stop the chain".
- The current URLFilter and URLNormalizer class preserve its own semantics.
- New exotic URLAnalyzer implementation can be added easily (and in any
order)
- Groups of filter-normalizer executions can be implemented by a wrapper
class that realize URLAnalyzer. This could be use for a custom hard-coded
order execution or for a workflow ordered execution.

Cons:
- Current URLFilter implementation should migrate to the new API.....
"implements" keyword must be replaced with "extends", because of abstract
definition of URLFilter and URLNormalizer

--
Fede_