[jira] Created: (NUTCH-361) generator create fetchlist randomly

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
generator create fetchlist randomly
-----------------------------------

                 Key: NUTCH-361
                 URL: http://issues.apache.org/jira/browse/NUTCH-361
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.9.0
         Environment: Java 1.5, FreeBSD 6.1
            Reporter: Uros Gruber
            Priority: Critical


I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.

The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.

I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate

I also se some of
2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
        at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
        at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
        at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
        at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
        at org.apache.nutch.crawl.Generator.run(Generator.java:405)
        at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
        at org.apache.nutch.crawl.Generator.main(Generator.java:372)

if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432322 ]
           
Sami Siren commented on NUTCH-361:
----------------------------------

I started to write (allready put some on svn trunk) some simple junit tests for the main tools (inject, generate, fetch). if you can extend some of those to demonstrate this problem then it would be easier to track down.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] Commented: (NUTCH-361) generator create fetchlist randomly

Uroš Gruber-2
Sami Siren (JIRA) wrote:
>     [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432322 ]
>            
> Sami Siren commented on NUTCH-361:
> ----------------------------------
>
> I started to write (allready put some on svn trunk) some simple junit tests for the main tools (inject, generate, fetch). if you can extend some of those to demonstrate this problem then it would be easier to track down.
>
>  
I run through it and here is my problem pop out       [junit] Tests run:
1, Failures: 0, Errors: 1, Time elapsed: 4.294 sec
   [junit] Test org.apache.nutch.crawl.TestGenerator FAILED

I run this on server. But I have problems run test from eclipse.
java.lang.ArithmeticException: / by zero
    at
org.apache.nutch.crawl.PartitionUrlByHost.getPartition(PartitionUrlByHost.java:49)
    at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:152)
    at
org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:223)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:51)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:106)

Probably some configuration problems.

regards

Uros

>> generator create fetchlist randomly
>> -----------------------------------
>>
>>                 Key: NUTCH-361
>>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>>             Project: Nutch
>>          Issue Type: Bug
>>          Components: fetcher
>>    Affects Versions: 0.9.0
>>         Environment: Java 1.5, FreeBSD 6.1
>>            Reporter: Uros Gruber
>>            Priority: Critical
>>
>> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
>> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
>> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
>> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
>> I also se some of
>> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
>> if I enable DEBUG loging but I doubt that this has anything to do with this.
>>    
>
>  

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432328 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

I run through it and here is my problem pop out      
  [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 4.294 sec
  [junit] Test org.apache.nutch.crawl.TestGenerator FAILED


I run this on server. The only thing I have in nutch-site.xml is http.agent.name
everything else is default.

I have some problems runing unit tests from eclipse.
java.lang.ArithmeticException: / by zero
   at org.apache.nutch.crawl.PartitionUrlByHost.getPartition(PartitionUrlByHost.java:49)
   at org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:152)
   at org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:223)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:51)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:195)
   at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:106)

Probably some configuration problems I need to figure it out.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432362 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

I think I found the problem. At least there's no unit test failures. I just use hadoop 0.4 instead of 0.5. Still need to test with latest trunk from hadoop.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432364 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

here is result from latest 0.5.1-dev hadoop

 [junit] Running org.apache.nutch.crawl.TestInjector
 [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 2.036 sec
 [junit] Test org.apache.nutch.crawl.TestInjector FAILED
    [junit] Running org.apache.nutch.crawl.TestLinkDbMerger
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 2.104 sec
    [junit] Running org.apache.nutch.crawl.TestMapWritable
    [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 5.592 sec
    [junit] Running org.apache.nutch.crawl.TestSignatureFactory
    [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.653 sec
    [junit] Running org.apache.nutch.fetcher.TestFetcher
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 2.151 sec
    [junit] Test org.apache.nutch.fetcher.TestFetcher FAILED

It' looks my problem with generating fetchlist is gone but two new junit failures pop out.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432441 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

The problem with 0.5.1-dev is actualy UTF-8 is beeing replaced with Text.

[junit] java.lang.ClassCastException: org.apache.hadoop.io.Text

I found the same problem in HADOOP-460

I try to create a patch to replace this but I think we need to talk about this, because CrawlDatum is stored in all dbs.


> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

limitation

Anton Potekhin
How to limit the pages number processed from each domain? And how to setup
nutch to crawl only domains added by me (i.e. make nutch to ignore external
links)? If nutch doesn't allow it then what algorithm will be the best for
it?


p.s. nutch ver.0.7
 


Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-361?page=all ]

Uros Gruber updated NUTCH-361:
------------------------------

    Attachment: partition.diff

Patch to check number of reduce tasks and set it to 1 in case it is set to 0.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432858 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

Here is my latest findings. While debuging TestGenerator I found  that in org.apache.nutch.crawl.PartitionUrlByHost.getPartition numReduceTasks became 0 which cause java.lang.ArithmeticException: / by zero. My patch does not solve the problem.

One thing I can't understand is. First thread goes through all 100 urls and in log I see  map 100%  reduce 0% after that new thread is started from 0% but this thread is run with numReduceTasks to zero.

All this is tested with released hadoop 0.5.0 also bundled with SVN nutch.

I don't know If I'm the only one seeing this bug, because no one reply.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432861 ]
           
Sami Siren commented on NUTCH-361:
----------------------------------

nightly buils are broken because of this problem, I scratched my head for a long time because my local shource was working perfectly - then i noticed that I had set following prop in my hadoop-site.xml:


<property>
  <name>mapred.map.tasks</name>
  <value>1</value>
  <description>
    define mapred.map tasks to be number of slave hosts
  </description>
</property>

I need to dig further, possibly the testcase is broken.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432864 ]
           
Sami Siren commented on NUTCH-361:
----------------------------------

oops, pasted wron property

<property>
  <name>mapred.reduce.tasks</name>
  <value>1</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts
  </description>
</property>

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432868 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

I also play around with this property but I noticed this

The default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is "local".

Funny thing is that with latest hadoop from SVN testGenerator works.

Also no matter where I put this property test failed with ArithmeticException

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432869 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

And again tested with this propery test passed.
I'll try my url list if this was the case.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432873 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

With mapred.reduce.tasks set to 1 in hadoop-site.xml all problems just disappear. Do you know why sudently this could be a problem? With 0.4.0 hadoop things works without this.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432877 ]
           
Sami Siren commented on NUTCH-361:
----------------------------------

I have not tracked hadoop development that intensively so I really have no idea about all the changes from 0.4.x to 0.5.x

More strangely 1 is the default value for it and i can not see any code that tries to modify it.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432899 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

Exactly my point. When debuging with hadoop sourcer I found some sort of numTasks++ and numTasks-- and also if there is local version to set initialy to 0. Which is strange but some lines after that it get raised by one and later lovered by one. So mybe there is some problem in hadoop but only in 0.5.0.

Do you think this bug can be resolved or we need to make some info about this property?

For me test passed also production server works with that.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12432902 ]
           
Andrzej Bialecki  commented on NUTCH-361:
-----------------------------------------

Please create a new issue in Hadoop JIRA, and copy the desciption - most folks from Hadoop team are not involved in the Nutch development, so they are probably unaware that something is wrong.

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12433169 ]
           
Sami Siren commented on NUTCH-361:
----------------------------------

The / by 0 was due to bug in testcase. Now the testcase fails about 50% of time. I also noticed that the number of reduce tasks is set to 2 in generator (I was looking from too far).





> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-361) generator create fetchlist randomly

Sebastian Nagel (Jira)
In reply to this post by Sebastian Nagel (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-361?page=comments#action_12438730 ]
           
Uros Gruber commented on NUTCH-361:
-----------------------------------

Hi, I'm back from vacation and while checking what is going on with this I found that there' were no response from HADOOP-511. Any news about this here?

> generator create fetchlist randomly
> -----------------------------------
>
>                 Key: NUTCH-361
>                 URL: http://issues.apache.org/jira/browse/NUTCH-361
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.9.0
>         Environment: Java 1.5, FreeBSD 6.1
>            Reporter: Uros Gruber
>            Priority: Critical
>         Attachments: partition.diff
>
>
> I noticed problems during generating fetchlist. I already post some info at the users list. Today I check release 0.8 and I'm certain that problem is only in version later than this. I've do testnig only on 0.8 and svn from today.
> The problem is that generator generate fetchlist from crawldb but everytime i run there is different number of urls in fetchlist.
> For example I put 6 test urls we have for testing and only 5 of 20 test there were all urls listed in fetchlist, sometimes onyl one. Config was always the same also when testing at version 0.8.
> I try to debug what might go wrong but I only end up that in /tmp there were all urls but somehow missed in crawl_generate
> I also se some of
> 2006-09-02 20:14:20,147 DEBUG conf.Configuration - java.io.IOException: config(config)
>         at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:76)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:87)
>         at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:98)
>         at org.apache.nutch.util.NutchJob.<init>(NutchJob.java:26)
>         at org.apache.nutch.crawl.Generator.generate(Generator.java:330)
>         at org.apache.nutch.crawl.Generator.run(Generator.java:405)
>         at org.apache.nutch.util.ToolBase.doMain(ToolBase.java:145)
>         at org.apache.nutch.crawl.Generator.main(Generator.java:372)
> if I enable DEBUG loging but I doubt that this has anything to do with this.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira