Found the bug in Generator when number of URLs is small

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Found the bug in Generator when number of URLs is small

Vishal Shah-3
Hi,
 
   I think I found the reason why the generator returns with an empty
fetchlist for small fetchsizes.
 
   After the first job finishes running, the generator checks the following
condition to see if it got an empty list:
 
    if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
 
  The third condition is incorrect here. In some cases, esp. for small
fetchlists, the first partition might be empty, but some other partition(s)
might contain urls. In this case, the Generator is incorrectly assuming that
all partitions are empty by just looking at the first. This problem could
also occur when all URLs in the fetchlist are from the same host (or from a
very small number of hosts, or from a number of hosts that all map to a
small number of partitions).
 
  I fixed this problem by replacing the following code:
 
    // check that we selected at least some entries ...
    SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
tempDir);
    if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
      LOG.warn("Generator: 0 records selected for fetching, exiting ...");
      LockUtil.removeLockFile(fs, lock);
      fs.delete(tempDir);
      return null;
    }
 
With the following code:
 
   // check that we selected at least some entries ...
    SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
tempDir);
    boolean empty = true;
    if (readers != null && readers.length > 0) {
            for (int num=0; num<readers.length; num++){
                        if (readers[num].next(new FloatWritable())) {
                                    empty = false;
                                    break;
                        }
            }
    }
    if (empty) {
      LOG.warn("Generator: 0 records selected for fetching, exiting ...");
      LockUtil.removeLockFile(fs, lock);
      fs.delete(tempDir);
      return null;
    }
 
This seems to do the trick.
 
Regards,
 
-vishal.
Reply | Threaded
Open this post in threaded view
|

Re: Found the bug in Generator when number of URLs is small

Doğacan Güney-3
On 6/21/07, Vishal Shah <[hidden email]> wrote:

> Hi,
>
>    I think I found the reason why the generator returns with an empty
> fetchlist for small fetchsizes.
>
>    After the first job finishes running, the generator checks the following
> condition to see if it got an empty list:
>
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>
>   The third condition is incorrect here. In some cases, esp. for small
> fetchlists, the first partition might be empty, but some other partition(s)
> might contain urls. In this case, the Generator is incorrectly assuming that
> all partitions are empty by just looking at the first. This problem could
> also occur when all URLs in the fetchlist are from the same host (or from a
> very small number of hosts, or from a number of hosts that all map to a
> small number of partitions).
>
>   I fixed this problem by replacing the following code:
>
>     // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
> tempDir);
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> With the following code:
>
>    // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers = SequenceFileOutputFormat.getReaders(job,
> tempDir);
>     boolean empty = true;
>     if (readers != null && readers.length > 0) {
>             for (int num=0; num<readers.length; num++){
>                         if (readers[num].next(new FloatWritable())) {
>                                     empty = false;
>                                     break;
>                         }
>             }
>     }
>     if (empty) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> This seems to do the trick.

Nice catch. Can you open a JIRA issue and attach a patch there?

>
> Regards,
>
> -vishal.
>


--
Doğacan Güney
Reply | Threaded
Open this post in threaded view
|

RE: Found the bug in Generator when number of URLs is small

Vishal Shah-3
Hi Dogacan,

I've uploaded the patch to Nutch-503.

http://issues.apache.org/jira/browse/NUTCH-503


Regards,

-vishal.

-----Original Message-----
From: Dogacan Güney [mailto:[hidden email]]
Sent: Thursday, June 21, 2007 12:33 PM
To: [hidden email]; [hidden email]
Subject: Re: Found the bug in Generator when number of URLs is small

On 6/21/07, Vishal Shah <[hidden email]> wrote:
> Hi,
>
>    I think I found the reason why the generator returns with an empty
> fetchlist for small fetchsizes.
>
>    After the first job finishes running, the generator checks the
following
> condition to see if it got an empty list:
>
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>
>   The third condition is incorrect here. In some cases, esp. for small
> fetchlists, the first partition might be empty, but some other
partition(s)
> might contain urls. In this case, the Generator is incorrectly assuming
that
> all partitions are empty by just looking at the first. This problem could
> also occur when all URLs in the fetchlist are from the same host (or from
a
> very small number of hosts, or from a number of hosts that all map to a
> small number of partitions).
>
>   I fixed this problem by replacing the following code:
>
>     // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers =
SequenceFileOutputFormat.getReaders(job,

> tempDir);
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> With the following code:
>
>    // check that we selected at least some entries ...
>     SequenceFile.Reader[] readers =
SequenceFileOutputFormat.getReaders(job,

> tempDir);
>     boolean empty = true;
>     if (readers != null && readers.length > 0) {
>             for (int num=0; num<readers.length; num++){
>                         if (readers[num].next(new FloatWritable())) {
>                                     empty = false;
>                                     break;
>                         }
>             }
>     }
>     if (empty) {
>       LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>       LockUtil.removeLockFile(fs, lock);
>       fs.delete(tempDir);
>       return null;
>     }
>
> This seems to do the trick.

Nice catch. Can you open a JIRA issue and attach a patch there?

>
> Regards,
>
> -vishal.
>


--
Dogacan Güney