Questions on normalizer and filter related code in Crawl, Injector and Generator

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Questions on normalizer and filter related code in Crawl, Injector and Generator

Susam Pal
I found a few of things in org.apache.nutch.crawl package which I want
to ask. I have three questions.

(1) In Injector.java, normalize() happens first and then filter()
happens, where as in Generator.java filter() happens in map phase and
normalize() happens in reduce phase. Why is the order different in
both?

Injector.java (Lines: 74 - 77)

      try {
        url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
        url = filters.filter(url);             // filter the url
      } catch (Exception e) {

Generator.java (Lines: 130 - 131)

          if (filters.filter(url.toString()) == null)
            return;

Generator.java (Line: 218)

urlString = normalizers.normalize(urlString,
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);

(2) In Generator.java, the normalizers.normalize() statement is inside
the following 'if' block.

Generator.java (Line: 186)

        if (maxPerHost > 0) {

I am curious to know why we should avoid URL normalization if
generate.max.per.host = -1 (which also happens to be the default value
in 'conf/nutch-default.xml'?)

(3) In Generator.java, the filters.filter() statement is inside the
following 'if' block. So, if filter is false, URL filtering won't be
done.

Generator.java (Lines: 127 - 132)

      if (filter) {
        // If filtering is on don't generate URLs that don't pass URLFilters
        try {
          if (filters.filter(url.toString()) == null)
            return;

In generate() method, the following code sets the filter value passed to it.

Generate.java (Line: 407)

job.setBoolean(CRAWL_GENERATE_FILTER, filter);

Now, if we see the Crawl.java code, we'll find that it is always
setting the filter to false. The 6th argument to the generate() method
is the filter value.

Crawl.java (Lines: 117 - 119)

    for (i = 0; i < depth; i++) {             // generate new segment
      Path segment = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis(), false, false);

So, this is switching the filter off for any crawl done using,
'bin/nutch crawl' tool. Am I correct? Should I open a JIRA issue for
this and submit a one line fix?

Regards,
Susam Pal
Reply | Threaded
Open this post in threaded view
|

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

Dennis Kubes-2


Susam Pal wrote:

> I found a few of things in org.apache.nutch.crawl package which I want
> to ask. I have three questions.
>
> (1) In Injector.java, normalize() happens first and then filter()
> happens, where as in Generator.java filter() happens in map phase and
> normalize() happens in reduce phase. Why is the order different in
> both?
>
> Injector.java (Lines: 74 - 77)
>
>       try {
>         url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT);
>         url = filters.filter(url);             // filter the url
>       } catch (Exception e) {
>
> Generator.java (Lines: 130 - 131)
>
>           if (filters.filter(url.toString()) == null)
>             return;
>
> Generator.java (Line: 218)
>
> urlString = normalizers.normalize(urlString,
> URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
>
> (2) In Generator.java, the normalizers.normalize() statement is inside
> the following 'if' block.
>
> Generator.java (Line: 186)
>
>         if (maxPerHost > 0) {
>
> I am curious to know why we should avoid URL normalization if
> generate.max.per.host = -1 (which also happens to be the default value
> in 'conf/nutch-default.xml'?)

Yeah, that looks like a bug.  IMO, normalization should be outside of
that block.  That would need a JIRA and a patch. :)

>
> (3) In Generator.java, the filters.filter() statement is inside the
> following 'if' block. So, if filter is false, URL filtering won't be
> done.
>
> Generator.java (Lines: 127 - 132)
>
>       if (filter) {
>         // If filtering is on don't generate URLs that don't pass URLFilters
>         try {
>           if (filters.filter(url.toString()) == null)
>             return;
>
> In generate() method, the following code sets the filter value passed to it.
>
> Generate.java (Line: 407)
>
> job.setBoolean(CRAWL_GENERATE_FILTER, filter);
>
> Now, if we see the Crawl.java code, we'll find that it is always
> setting the filter to false. The 6th argument to the generate() method
> is the filter value.
>
> Crawl.java (Lines: 117 - 119)
>
>     for (i = 0; i < depth; i++) {             // generate new segment
>       Path segment = generator.generate(crawlDb, segments, -1, topN, System
>           .currentTimeMillis(), false, false);
>
> So, this is switching the filter off for any crawl done using,
> 'bin/nutch crawl' tool. Am I correct? Should I open a JIRA issue for
> this and submit a one line fix?

For the generator alone filtering is on by default, in main method:

     boolean filter = true;
...
       } else if ("-noFilter".equals(args[i])) {
         filter = false;
...
       Path seg = generate(dbDir, segmentsDir, numFetchers, topN,
curTime, filter, force);

then in the generate method:

job.setBoolean(CRAWL_GENERATE_FILTER, filter);

filter is taken from input if not found in the "crawl.generate.filter"
configuration variable.

But for crawl command, looks like yes filtering is off by default.
Should it be enabled?  I guess I would like to get community input on
that.  If the config variable is set then it would override anyways?
Seems to me like the only reason to change it currently in crawl would
just to be to keep it consistent with the default behavior of generator.

Dennis


>
> Regards,
> Susam Pal
Reply | Threaded
Open this post in threaded view
|

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

Susam Pal
I am adding a few more observations.

On Feb 6, 2008 1:47 AM, Dennis Kubes <[hidden email]> wrote:

> For the generator alone filtering is on by default, in main method:
>
>      boolean filter = true;
> ...
>        } else if ("-noFilter".equals(args[i])) {
>          filter = false;
> ...
>        Path seg = generate(dbDir, segmentsDir, numFetchers, topN,
> curTime, filter, force);
>
> then in the generate method:
>
> job.setBoolean(CRAWL_GENERATE_FILTER, filter);
>
> filter is taken from input if not found in the "crawl.generate.filter"
> configuration variable.

The boolean value for filter is always false in a crawl using
'bin/nutch crawl' irrespective of whether 'crawl.generate.filter' is
found in configuration files or not. The reason is the code is
unconditionally doing job.setBoolean(CRAWL_GENERATE_FILTER, filter)
and the value of filter passed by Crawl.java is false.

>
> But for crawl command, looks like yes filtering is off by default.
> Should it be enabled?  I guess I would like to get community input on
> that.  If the config variable is set then it would override anyways?

In current code the configuration won't override as I have explained
above. If we delete the job.setBoolean(CRAWL_GENERATE_FILTER, filter);
line in generate() method, it would fix the problem.

> Seems to me like the only reason to change it currently in crawl would
> just to be to keep it consistent with the default behavior of generator.
>
> Dennis
>

I feel the same too. In fact the last two values that Crawl.java sends
as false is not required at all.

      Path segment = generator.generate(crawlDb, segments, -1, topN, System
          .currentTimeMillis(), false, false);

What is the point in unconditionally passing false? We can always do:-

filter = job.getBoolean(CRAWL_GENERATE_FILTER, false);

inside Generator.

This is currently being done in public void configure(JobConf job) but
the default value passed is true which is consistent with the default
behavior of Generator. It is much better to expose an overloaded
generate() method which takes only the 5 arguments that Crawl needs to
set.

Regards,
Susam Pal
Reply | Threaded
Open this post in threaded view
|

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

Dennis Kubes-2


Susam Pal wrote:

> I am adding a few more observations.
>
> On Feb 6, 2008 1:47 AM, Dennis Kubes <[hidden email]> wrote:
>> For the generator alone filtering is on by default, in main method:
>>
>>      boolean filter = true;
>> ...
>>        } else if ("-noFilter".equals(args[i])) {
>>          filter = false;
>> ...
>>        Path seg = generate(dbDir, segmentsDir, numFetchers, topN,
>> curTime, filter, force);
>>
>> then in the generate method:
>>
>> job.setBoolean(CRAWL_GENERATE_FILTER, filter);
>>
>> filter is taken from input if not found in the "crawl.generate.filter"
>> configuration variable.
>
> The boolean value for filter is always false in a crawl using
> 'bin/nutch crawl' irrespective of whether 'crawl.generate.filter' is
> found in configuration files or not. The reason is the code is
> unconditionally doing job.setBoolean(CRAWL_GENERATE_FILTER, filter)
> and the value of filter passed by Crawl.java is false.
>
>> But for crawl command, looks like yes filtering is off by default.
>> Should it be enabled?  I guess I would like to get community input on
>> that.  If the config variable is set then it would override anyways?
>
> In current code the configuration won't override as I have explained
> above. If we delete the job.setBoolean(CRAWL_GENERATE_FILTER, filter);
> line in generate() method, it would fix the problem.
>
>> Seems to me like the only reason to change it currently in crawl would
>> just to be to keep it consistent with the default behavior of generator.
>>
>> Dennis
>>
>
> I feel the same too. In fact the last two values that Crawl.java sends
> as false is not required at all.
>
>       Path segment = generator.generate(crawlDb, segments, -1, topN, System
>           .currentTimeMillis(), false, false);
>
> What is the point in unconditionally passing false? We can always do:-
>
> filter = job.getBoolean(CRAWL_GENERATE_FILTER, false);
>
> inside Generator.

+1, I completely agree.  Do you want to work up this patch?

Dennis

>
> This is currently being done in public void configure(JobConf job) but
> the default value passed is true which is consistent with the default
> behavior of Generator. It is much better to expose an overloaded
> generate() method which takes only the 5 arguments that Crawl needs to
> set.
>
> Regards,
> Susam Pal
Reply | Threaded
Open this post in threaded view
|

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

Susam Pal
In reply to this post by Dennis Kubes-2
> Susam Pal wrote:
> > (2) In Generator.java, the normalizers.normalize() statement is inside
> > the following 'if' block.
> >
> > Generator.java (Line: 186)
> >
> >         if (maxPerHost > 0) {
> >
> > I am curious to know why we should avoid URL normalization if
> > generate.max.per.host = -1 (which also happens to be the default value
> > in 'conf/nutch-default.xml'?)
>
> Yeah, that looks like a bug.  IMO, normalization should be outside of
> that block.  That would need a JIRA and a patch. :)

I investigated the URLNormalizers code a bit more and I feel it may
not be required. In URLNormalizers
<http://lucene.apache.org/nutch/apidocs/org/apache/nutch/net/URLNormalizers.html#field_summary>
we can see that there is no scope for generate alone (no
SCOPE_GENERATE). I performed a few crawls with custom logs to check
the phases at which the normalize() method of URLNormalizers is called
and what scope is passed to it. I find that after all the URLs are
fetched by the fetcher, when the outlinks are generated, the
normalize() is run on every outlink with SCOPE_OUTLINK. So, I feel we
need not run normalize again at the beginning of Generator at the next
depth.

Even when Generator is called just after Injector, the Injector would
have done a normalize() with SCOPE_INJECT. By default all the scopes
use the same  'urlnormalizer.regex.file' property since we do not have
scope specific regex-normalize.xml file.

Regards,
Susam Pal
Reply | Threaded
Open this post in threaded view
|

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

Susam Pal
In reply to this post by Dennis Kubes-2
Yes, this should be a simple patch. I will upload one tomorrow.

Regards,
Susam Pal

On Feb 7, 2008 12:11 AM, Dennis Kubes <[hidden email]> wrote:

>
>
>
> Susam Pal wrote:
> > I am adding a few more observations.
> >
> > On Feb 6, 2008 1:47 AM, Dennis Kubes <[hidden email]> wrote:
> >> For the generator alone filtering is on by default, in main method:
> >>
> >>      boolean filter = true;
> >> ...
> >>        } else if ("-noFilter".equals(args[i])) {
> >>          filter = false;
> >> ...
> >>        Path seg = generate(dbDir, segmentsDir, numFetchers, topN,
> >> curTime, filter, force);
> >>
> >> then in the generate method:
> >>
> >> job.setBoolean(CRAWL_GENERATE_FILTER, filter);
> >>
> >> filter is taken from input if not found in the "crawl.generate.filter"
> >> configuration variable.
> >
> > The boolean value for filter is always false in a crawl using
> > 'bin/nutch crawl' irrespective of whether 'crawl.generate.filter' is
> > found in configuration files or not. The reason is the code is
> > unconditionally doing job.setBoolean(CRAWL_GENERATE_FILTER, filter)
> > and the value of filter passed by Crawl.java is false.
> >
> >> But for crawl command, looks like yes filtering is off by default.
> >> Should it be enabled?  I guess I would like to get community input on
> >> that.  If the config variable is set then it would override anyways?
> >
> > In current code the configuration won't override as I have explained
> > above. If we delete the job.setBoolean(CRAWL_GENERATE_FILTER, filter);
> > line in generate() method, it would fix the problem.
> >
> >> Seems to me like the only reason to change it currently in crawl would
> >> just to be to keep it consistent with the default behavior of generator.
> >>
> >> Dennis
> >>
> >
> > I feel the same too. In fact the last two values that Crawl.java sends
> > as false is not required at all.
> >
> >       Path segment = generator.generate(crawlDb, segments, -1, topN, System
> >           .currentTimeMillis(), false, false);
> >
> > What is the point in unconditionally passing false? We can always do:-
> >
> > filter = job.getBoolean(CRAWL_GENERATE_FILTER, false);
> >
> > inside Generator.
>
> +1, I completely agree.  Do you want to work up this patch?
>
> Dennis
>
>
> >
> > This is currently being done in public void configure(JobConf job) but
> > the default value passed is true which is consistent with the default
> > behavior of Generator. It is much better to expose an overloaded
> > generate() method which takes only the 5 arguments that Crawl needs to
> > set.
> >
> > Regards,
> > Susam Pal
>
Reply | Threaded
Open this post in threaded view
|

Re: Questions on normalizer and filter related code in Crawl, Injector and Generator

Dennis Kubes-2
In reply to this post by Susam Pal
Funny, I was working on something similar and came to about the same
conclusions.  While we could have a normalizer in generate, it seems to
me it causes much more problems (duplicates) then it is worth.
Currently the generator code normalizer doesn't work properly anyways
because we never update the collected url to the changed url.  Still, I
would like to get community thoughts on this as we may be missing something.

Dennis

Susam Pal wrote:

>> Susam Pal wrote:
>>> (2) In Generator.java, the normalizers.normalize() statement is inside
>>> the following 'if' block.
>>>
>>> Generator.java (Line: 186)
>>>
>>>         if (maxPerHost > 0) {
>>>
>>> I am curious to know why we should avoid URL normalization if
>>> generate.max.per.host = -1 (which also happens to be the default value
>>> in 'conf/nutch-default.xml'?)
>> Yeah, that looks like a bug.  IMO, normalization should be outside of
>> that block.  That would need a JIRA and a patch. :)
>
> I investigated the URLNormalizers code a bit more and I feel it may
> not be required. In URLNormalizers
> <http://lucene.apache.org/nutch/apidocs/org/apache/nutch/net/URLNormalizers.html#field_summary>
> we can see that there is no scope for generate alone (no
> SCOPE_GENERATE). I performed a few crawls with custom logs to check
> the phases at which the normalize() method of URLNormalizers is called
> and what scope is passed to it. I find that after all the URLs are
> fetched by the fetcher, when the outlinks are generated, the
> normalize() is run on every outlink with SCOPE_OUTLINK. So, I feel we
> need not run normalize again at the beginning of Generator at the next
> depth.
>
> Even when Generator is called just after Injector, the Injector would
> have done a normalize() with SCOPE_INJECT. By default all the scopes
> use the same  'urlnormalizer.regex.file' property since we do not have
> scope specific regex-normalize.xml file.
>
> Regards,
> Susam Pal