[jira] Created: (NUTCH-503) Generator exits incorrectly for small fetchlists

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
Generator exits incorrectly for small fetchlists
-------------------------------------------------

                 Key: NUTCH-503
                 URL: https://issues.apache.org/jira/browse/NUTCH-503
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 0.9.0, 0.8.1, 0.8
         Environment: Fedora Core 2, JDK 1.6
            Reporter: Vishal Shah
             Fix For: 0.8.2


   I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
 
   After the first job finishes running, the generator checks the following condition to see if it got an empty list:
 
    if (readers == null || readers.length == 0 || !readers[0].next(new
FloatWritable())) {
 
  The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal Shah updated NUTCH-503:
------------------------------

    Attachment: emptyfetchlist.patch

I've created a patch to fix this issue. Please review, and commit it to trunk if it's ok.

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vishal Shah updated NUTCH-503:
------------------------------

    Attachment: emptyfetchlist.patch

Hi,

   The previous patch is missing a header line. I've reattached the patch.

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506922 ]

Emmanuel Joke commented on NUTCH-503:
-------------------------------------

I just try your patch and i'm afraid I still have the same issue.

Actually I noticed something wrong:
- I did a first crawl with only 1 url (http://www.boursorama.com/), it didn't work. Ive got "Generator: 0 records selected for fetching, exiting"
- I did a second crawl with also only 1 url (http://lucene.apache.org/), it did work perfectly.
- i did a last test to crawl with both url, and I've got results for both site.

It looks weird.






> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507144 ]

Vishal Shah commented on NUTCH-503:
-----------------------------------

Hi Emmanuel,

   Can you please dump the contents of your crawldb after injecting your urls into the crawldb using the readdb command? Are these urls injected into the db in the first place? It could be that your urlfilters are filtering out your urls, or maybe there's some other problem. (esp. since the third test you did works). It would be good to know the contents of the crawldb before generate and after inject in each case.


> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507169 ]

Doğacan Güney commented on NUTCH-503:
-------------------------------------

Also, how many machines are there on your cluster and which version of nutch are you using?

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Issue Comment Edited: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507169 ]

Doğacan Güney edited comment on NUTCH-503 at 6/22/07 1:58 AM:
--------------------------------------------------------------

Also, how many machines are there on your cluster, how many partitions generator tries to create and which version of nutch are you using?


 was:
Also, how many machines are there on your cluster and which version of nutch are you using?

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507469 ]

Emmanuel Joke commented on NUTCH-503:
-------------------------------------

Sorry, my mistake.

My compiled jar was not correctly included in my classpath. I confirm that it does work with your patch.

Thanks for ur help.

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507535 ]

Doğacan Güney commented on NUTCH-503:
-------------------------------------

Nice to hear, Emmanuel.

I believe this is ready for committing, but, Vishal, can you add a test case for this? (Though, I am not sure how we can add a test case since this bug only occurs in distributed setups).

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509039 ]

Emmanuel Joke commented on NUTCH-503:
-------------------------------------

Results seems to good. So I'm wondering if it is possible to commit this patch ?

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509056 ]

Vishal Shah commented on NUTCH-503:
-----------------------------------

Hi Dogacan,

    I don't know how to write a test case to cover this particular bug. Any thoughts?

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12509059 ]

Doğacan Güney commented on NUTCH-503:
-------------------------------------

>  I don't know how to write a test case to cover this particular bug. Any thoughts?

Normally, you would update TestGenerator by generating a couple of urls then showing that even though other parts contain urls first one does not (So, nutch would fail this test case without your patch).

However, this bug only occurs in a distributed setup, but our test cases work in single machine setup (by default). Hadoop does have something called MiniMRCluster which (I think) allows you to run distributed tests. This class comes from hadoop's test jar which we don't have.

Since your patch is (hopefully:) obviously true, we can skip writing a unit case for this one. But we really need some sort of mechanism to run our tests in a distributed setup.

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>             Fix For: 0.8.2
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Resolved: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

     [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney resolved NUTCH-503.
---------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: 0.8.2)
                   1.0.0
         Assignee: Doğacan Güney

Committed in rev. 554539 with style changes.

I skipped the unit case part. But, we should consider bringing in hadoop test jar in the future so that we can run test jobs in a distributed environment.

PS: Vishal, for future reference, nutch uses 2-space indents.

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

spam detect

Anton Potekhin
Hello!

Does nutch have any modules for spam detect?
Does anyone know where I can find any information (blogs, articles, FAQ)
about it?

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511330 ]

Hudson commented on NUTCH-503:
------------------------------

Integrated in Nutch-Nightly #145 (See [http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/145/])

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-503) Generator exits incorrectly for small fetchlists

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)

    [ https://issues.apache.org/jira/browse/NUTCH-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529539 ]

The Jin Group commented on NUTCH-503:
-------------------------------------

I just apply this patch to my environment test. So i have a question, why bin/nutch setup the classpath point to build directory ? ...
I think there is a mistake settting up the classpth because, it's point to a *.job file ... so it's not a java standard file


Elsewere, the patch is OK.

regards

Jin

> Generator exits incorrectly for small fetchlists
> -------------------------------------------------
>
>                 Key: NUTCH-503
>                 URL: https://issues.apache.org/jira/browse/NUTCH-503
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.8, 0.8.1, 0.9.0
>         Environment: Fedora Core 2, JDK 1.6
>            Reporter: Vishal Shah
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: emptyfetchlist.patch, emptyfetchlist.patch
>
>
>    I think I found the reason why the generator returns with an empty fetchlist for small fetchsizes.
>  
>    After the first job finishes running, the generator checks the following condition to see if it got an empty list:
>  
>     if (readers == null || readers.length == 0 || !readers[0].next(new
> FloatWritable())) {
>  
>   The third condition is incorrect here. In some cases, esp. for small fetchlists, the first partition might be empty, but some other partition(s) might contain urls. In this case, the Generator is incorrectly assuming that all partitions are empty by just looking at the first. This problem could also occur when all URLs in the fetchlist are from the same host (or from a very small number of hosts, or from a number of hosts that all map to a small number of partitions).

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.