[jira] Created: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
mapreduce segment generator generates  50 % less  than excepted urls
--------------------------------------------------------------------

         Key: NUTCH-136
         URL: http://issues.apache.org/jira/browse/NUTCH-136
     Project: Nutch
        Type: Bug
    Versions: 0.8-dev    
    Reporter: Stefan Groschupf
    Priority: Critical


We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.

I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
First we set the partition to a normal hashPartitioner.
Second we changed Generator.java line 48:
limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
to:
limit = job.getLong("crawl.topN",Long.MAX_VALUE);

Now it works as expected.
Has anyone a idea what the real source of this problem can be?
In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  



--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363194 ]

Dominik Friedrich commented on NUTCH-136:
-----------------------------------------

I took me some hours but I finally solved the mystery. The problem is this line
177      numLists = job.getNumMapTasks();            // a partition per fetch task
in combination with this
211    job.setNumReduceTasks(numLists);
and the fact that nutch-site.xml overrides job.xml settings.

In my case I have on the box with the jobtracker and where I start job map.tasks=12 and reduce.tasks=4 defined in the nutch-site.xml. On the other three boxes there is no map.tasks or reduce.tasks in the nutch-site.xml. When the second job of the generator tool is started the jobtracker creates only 4 reduce task because reduce.tasks=4 in nutch-site.xml overrides the job.xml on this box. But the map task on the other 3 boxes read 12 reduce tasks from the job.xml and so they create 12 partitions. When the 4 reduce tasks are started they only read the data from partition 0-3 on that 3 boxes so 3*8 partitions get lost.

I solved this problem by removing line 211.

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363198 ]

Dominik Friedrich commented on NUTCH-136:
-----------------------------------------

I think the correct solution would be to move all mapred settings from nutch-site.xml into mapred-default.xml which is read before job.xml files.

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ]

Doug Cutting commented on NUTCH-136:
------------------------------------

The mapred-default.xml file is actually the best place to set these.

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-136?page=all ]

Doug Cutting updated NUTCH-136:
-------------------------------

    Comment: was deleted

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363587 ]

Mike Smith commented on NUTCH-136:
----------------------------------

I have had the same problem. Florent suggested to use "protocol-http" instead of "protocol-httpclient", this fixed the problem on single machine, but I still have the same problem  when I have multiple data nodes using NDFS. Commenting line 211 didn't help. Here is my results:

Injected URL: 80000
only one machine is datanode: 70000 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250
 
Injected URL: 80000
3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines:  20000 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250
 
Injected URL : 5000
3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines:  1200 fetched pages
map tasks: 12
reduce tasks: 6
threads: 250

 
Injected URL : 1000
3 machines are datanode. All machines are partipated in the fetching by looking at the task tracker logs on three machines:  240 fetched pages
 
Injected URL : 1000
only one machine is datanode: 800 fecthed pages
map tasks: 3
reduce tasks: 3
threads: 250

Thanks, Mike

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
    [ http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363886 ]

Florent Gluck commented on NUTCH-136:
-------------------------------------

On my setup of 5 boxes (4 slaves, 1 master), I confirm that what Dominik Friedrich suggested fixes the missing urls I've been encountering for a while.
I simply moved the following properties from nutch-site.xml to mapred-default.xml:

<property>
  <name>mapred.map.tasks</name>
  <value>100</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>40</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>/

After injecting 100'000 urls and doing a single pass crawl, I grepped the logs on my 4 slaves and confirmed that the sum of all the fetching attemps adds up to exactly 100'000.  Therefore, there is no need to modify Generator.java.
I also ran some tests with protocol-http and protocol-httpclient and verified that they give similar results.  No missing urls in both cases.

--Florent


> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

Steve Loughran (Jira)
In reply to this post by Steve Loughran (Jira)
     [ http://issues.apache.org/jira/browse/NUTCH-136?page=all ]
     
Andrzej Bialecki  closed NUTCH-136:
-----------------------------------

    Resolution: Duplicate

Thank you for investigating this. I'm closing this issue, further discussion should follow to NUTCH-186.

> mapreduce segment generator generates  50 % less  than excepted urls
> --------------------------------------------------------------------
>
>          Key: NUTCH-136
>          URL: http://issues.apache.org/jira/browse/NUTCH-136
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Stefan Groschupf
>     Priority: Critical

>
> We notice that segments generated with the map reduce segment generator contains only 50 % of the expected urls. We had a crawldb with 40 000 urls and the generate commands only created a 20 000 pages segment. This also happened with the topN parameter, we everytime got around 50 % of the expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected.
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 % of it's urls per iteration.  

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira